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Definition 0.1 (Asymptotic notation) 
Given functions or sequences f, g > 0 (usually of some parameter n — co), the notation in each bullet point below 


are considered equivalent: 


> fS9,f =O(g),9 = Qf), f < Cg (for some constant C); 
. f<g,f =o(g),5 30,9 =u/(f). 


‘Fag Oleg 5) oo 
-f~g,f>1,f =(1+0(1))g. 
Some event holds with high probability if its probability is 1 — o(1). 
Warning: analytic number theorists like to use the Vinogradov notation, where f < g means f = O(g) instead 


of f = o(g) as we do. In particular, 100 < 1 is correct in Vinogradov notation. 


1 Introduction to the probabilistic method 


In combinatorics and other fields of math, we often wish to show existence of some mathematical object. One clever 


way to do this is to try to construct this object randomly and then show that we succeed with positive probability. 


Proposition 1.1 


Every edge G = (V, E) with vertices V and edges E contains a bipartite subgraph with at least El edges. 


Proof. We can form a bipartite graph by partitioning the vertices into two groups. Randomly color each vertex either 
white or black (making the white and black sets the two groups), and include only the edges between a white and a 
black edge in a new graph G’. Since all vertices are colored independently at random, each edge is included in G’ with 


probability 5. Thus, we have an average of El edges in our graph by linearity of expectation, and this means that at 


least one coloring will work. 


This class will introduce a variety of methods to solve these types of problems, and we'll start with a survey of 


those techniques. 


1.1 The Ramsey numbers 


Definition 1.2 
Let the Ramsey number R(k, 2) be the smallest n such that if we color the edges of Kp (the complete graph on 


n vertices) red or blue, we always have a Kx that is all red or a Kg that is all blue. 


Theorem 1.3 (Ramsey, 1929) 
For any integers k, 2, R(k, @) is finite. 


One way to do this is to use the recurrence inequality 


R(r,s) < R(r—1,5)+ R(r,s—-1) 


by picking an arbitrary vertex v and partitioning the remaining vertices by the color of their edge to v. 


Theorem 1.4 (Erdés, 1947) 
We have R(k, k) > 1 for all 


@uac eat 


In other words, for any n that satisfies this inequality, we can color K, with no monochromatic Kx. 


Proof. Color the edges of K, randomly. Given any set R of k vertices, let Ag be the event where R is monochromatic 
(all (5) edges are the same color). The probability Ag occurs for any given R is a-(), since there are only 2 ways 


to color R, and thus the total probability that K, is monochromatic is 


Pr| LJ Ar 
Re(") 
and we can “union bound” this: the total probability is at most the sum of the probabilities of the independent events, 


SO 
n k 
ic) < = iG) 
Pr(monochromatic) < > Pr(Ar) (2 27, 


and as long as this is less than 1, there is a positive probability that no monochromatic coloring exists, and thus 
R(k,k) > n. 


Fact 1.5 


We can optimize Theorem 1.4 with Stirling’s formula to find that 


k/2 


R(k, k) > ( 


ara) 
ev/2 + 0(1) 


where the o(1) term goes to 0 as k + oo. 


This is a lower bound on the Ramsey numbers. It turns out we can also get an upper bound 
1 As 
R(s,s5)< + o(1 ; 
s lak o) 5, 
Currently, this is basically the best we can do: it is still an open problem to make the bases of the exponents tighter 
than V2 and 4. 


Remark. Because the name is Hungarian, the “s” in Erddés is pronounced as “sh,” while “sz” is actually pronounced “s.” 


1.2 Alterations 


We can almost immediately improve our previous bound by a bit. 


Proposition 1.6 


For all k,n, we have 


R(k,k) > n— ({,)2-. 


Proof. As before, color the edges of K, randomly. This time, let Ag be the indicator variable for a set R of k vertices. 
(This means that Ap is equal to 1 if R is monochromatic and 0 otherwise.) The expected value of each Ap is just the 


probability that R is monochromatic, which is a-(3), so the expected number of monochromatic K,s is the sum of 


[X] = ({,)2@. 


Now delete one vertex from each monochromatic k-clique: we delete X vertices at most (possibly with repeats), so 


me @uac 


vertices. But this graph has all monochromatic k-cliques removed, and thus there exists a graph with at least this 


all Ags, which is 


now we have an expected 


many vertices and no monochromatic k-clique. 


Fact 1.7 


Using the same optimization with Stirling's formula on Proposition 1.6, 


R(k,k) > (Z + 0(1)) Kone 


which is better than the result above by a factor of 2. 


Both of these proofs are interesting, because although we now know a graph exists, we can't actually construct 


such an example easily! 


1.3 Lovasz Local Lemma 


We're going to discuss some methods in this class beyond just picking things randomly: here's one of them. Let's say 
that we are trying to avoid a bunch of bad events F,, Fs,--- , E, simultaneously. There's two main ways we know 


how to avoid them: 


¢ All the probabilities are small, and there aren't too many of them. In particular, if the total sum of probabilities 
is at most 1, we always have a positive chance of success. 


- If all the events are independent, then the probability of success is just the product of individual avoidances. 


Theorem 1.8 (Lovasz Local Lemma) 
Let E,,---,&, be events each with probability at most p, where each event E£; is mutually independent of all 


other Ejs except at most d of them. If ep(d +1) <1, then there is a positive probability that no E; occurs. 


Corollary 1.9 (Spencer, 1975) 
We have R(k, k) > n if 


o((\(pra)#2) 2-0 a1 


Proof. Randomly color all the edges, and again let Ag be the indicator variable for a subset R of k vertices forming 


a monochromatic clique. Note that all Ag and As are mutually independent unless they share an edge, meaning 


[RM S| > 2. For each given R, there are at most (5)(,”,) choices for S, since we pick 2 vertices to share with R 


and then pick the rest however we'd like. Now by Lovasz Local Lemma, we have a positive probability no Ag occurs 


ep(d+1)=e ((3) ee + 1) 2-() <i. 


as long as 


Fact 1.10 
This time, optimizing n in Corollary 1.9 yields 


R(k,k) > (2 - 0(1)) oe 


1.4 Set systems 


Let F bea collection of subsets of [n] = {1,2,--- , n} (there are a total of 2” subsets to put in F). We call this an 
antichain if there is no set in F that is contained in another one. 

Our question: what is the largest possible antichain? One thing we can do is to only use subsets of a fixed size k, 
since no set can be contained in another. This means we can at least get (in/2| Dy the largest binomial coefficient. It 
turns out that this is the best bound: 


Theorem 1.11 (Sperner, 1928) 


If F is an antichain of subsets of [n], then it has size at most (n/2} Ne 


To show this, we'll prove a more slightly general result: 


Theorem 1.12 


For any antichain F of the subsets of [n], 


This implies the result above, because it is a weighted sum where each weight (jal) is at most (in/2}) (and the 


central binomial coefficients are largest). 
Proof. Fix a random permutation o of [n]. Associated with this permutation, we have a chain 
@ € {o(1)} € {o(1), (2) S--- € fa(1),--- ,o(n)} = [nl]. 


Each subset A has probability P4 = ay of appearing in such a chain, since each |A|-element subset has the same 
chance of appearing. However, no two subsets can appear in the same chain, so the events are disjoint. Thus, the 


sum of probabilities that A appears in the chain must be at most 1, and thus 


eT (a) sh 


ACF ACF 


Theorem 1.13 (Bollobas’ Two Families Theorem) 


Given r-element sets Aj,--- , Am and s-element sets By,--- , Bm, if we know that 


Ain Bb; =S if andonlyif :=ys 


(all A; and B; intersect except for i =j), then m< ("*). 


Where's the motivation for this coming from? 


Definition 1.14 
Given a family of sets F, let a transversal T be a set that intersects all S € F, and let the transversal number 
T(F) denote the size of the smallest transversal of F. F is T-critical if we have T(F \ S) < T(F) for all S € F. 


Corollary 1.15 (of Theorem 1.13) 


An r-uniform t-critical family of sets F with T(F) = s+ 1 has size at most (aa: 


r 


Proof. Let our family of sets be Ay,--- , Am. F being T-critical implies that for any /, we can find a transversal of size 
s for F \ Aj. Letting this be B;, notice that Aj; Bj = @ <= | =/J, and thus by Bollobas’ Theorem we can find the 


upper bound stated. 


Here's a slightly more general version of Bollobas’ Theorem, which we'll prove now: 


Theorem 1.16 
Let A1,---,Am, B1,--- , Bm be finite sets, such that A; Bj = © if and only if = y. Then 


m ; y\ 71 
ae) apie 
= [Ail 


Notice that if we make B; = [n] \ A; for all /, we get Sperner’s theorem. Meanwhile, if all A;s have size r and all 


Bjs have size s, we get Bollobas’ Two Families Theorem. 


Proof. Like in Sperner’s theorem, randomly order all elements in the union of all A; and Bjs. For any /, the probability 


|Ai| a 


that all of A; occurs before all of B; is ( He , and we can’t have this happen with two different /s in any given 


ordering, because this would mean that either Aj and B; are disjoint or Aj and B; are disjoint. Thus all events of this 


ee =f , 
form are disjoint, and we must have yy a) <1, as desired. 


Definition 1.17 


A family of sets F is intersecting if AN B 4 @ for all A, BE F. 


Note that this does not mean they must all have a common element: for example, {{1,2}, {1,3}, {2, 3}} is 


intersecting. 


Theorem 1.18 (Erdds-Ko-Rado 1961) 


If n > 2k, then all intersecting families of k-element subsets of [n] = {1,2,--- ,n} have size at most (ean 


(This can be constructed by having all sets share the element 1, for example.) 


Proof. Order the integers 1, 2,--- ,n around a circle randomly. Let a subset A C [n] be contiguous if all elements lie 


in a contiguous block around the circle. For any subset A with |A| = k; the probability it is contiguous is 


(think of picking k of the spots around the circle). So the expected number of contiguous subsets is IFI((")). but if all 
k 
subsets are intersecting, we can only have k contiguous subsets (here, as long as n > 2k, all contiguous subsets must 


pass through a common point, which is why we set up the problem this way). Thus, IFI((")) < k, and rearranging 
k 


ns¥(0)= (02) 


yields 


as desired. 


1.5 Hypergraph colorings 


This is a topic we'll be discussing quite a bit in this class, but the idea is very similar to that of set systems. 


Definition 1.19 
A k-uniform hypergraph H(V,£) has a (finite) set of vertices V and a set of edges E, each of which is a 


k-element subset of V. H ts r-colorable if we can color V with r colors such that no edge is monochromatic 


(that is, not all the vertices in an edge have the same color). 


(Regular graphs are 2-uniform hypergraphs.) Let m(k) to be the minimum number of edges in a k-uniform 


hypergraph that isn’t 2-colorable. 


Example 1.20 
A triangle is not 2-colorable, so m(2) = 3. The Fano plane is not 2-colorable if we interpret lines as edges, so 


m(3) =7 (any smaller example can be checked). 


These quickly become hard to calculate, though: m(4) = 23, but m(5) is actually currently unknown. 


Theorem 1.21 


A k-uniform hypergraph with fewer than 2-1 edges is 2-colorable. 


Proof. Color each vertex randomly; each edge has probability 2'~* of being monochromatic, since all k vertices need 


to be one color or the other. Thus, if we have less than 24—! edges, the expected number of monochromatic edges is 


less than 1, so there is a way to 2-color the hypergraph successfully. 


To date, we have the bounds (which are reasonably close to each other) 
k k 25k 
m(k) > 2 icgk and = m(k) = O(k*2*). 


How do we show the upper bound? We can restate it as follows: 


Problem 1.22 


Construct a k-uniform hypergraph with O(k?2) edges that is not k-colorable. 


Solution. Start with a set of vertices V where |V| = n, and let H be the hypergraph constructed by choosing m edges 
Si, S2,+++,Sm at random. For any coloring of the vertices x : V — red, blue, the event A(x) refers to H containing 


no monochromatic edges. Then our goal is to pick m,n so that 
So Pr(Ai) <1, 
x 


because this means there is a graph H that cannot be properly colored regardless of which x we pick. 


A coloring x that colors a vertices red and b vertices blue has a given S; monochromatic with probability 
b /2 
(2) +0) , 202) 
n —_— n 
(i) (i) 
n 


(since there are (2) total sets of vertices and (2) + (2) of them are monochromatic). Further bounding, this is 


nf2—k+1\* ; kaa “YS 
So { ee) Soe te ee 
~ CA) fol) 


where we pick n = k? so that we can have 


em el ame oe 
n—k+1 a 


a constant. So now the probability that we have a proper coloring (which means no S; is monochromatic) is at most 
(looking at all Sjs now) 
ad = on re eo 


since we chose our S;s randomly (possibly with replacement), and then 1+ x < e* for all x. Therefore, if we sum over 


all x, we have 
= —km a —km 
s ge ae el 
x 


for some value of m = O(k?2*), as desired. 


Now that we have a sampling of some preliminary techniques, we'll begin examining them in more detail in the next 


few chapters! 
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2 Linearity of expectation 


2.1 Setup and basic examples 


Often, a random variable X can be written as 
X= 4X1 + OX2 4-55 + CnXn, 


where c are constants and X; are random variables, not necessarily independent. In these cases, we know that 


a[X] = coE[Xi] +--+: + chE[Xz]. 


However, it is not necessarily true that E[XY] = E[X]E[Y]. 


Example 2.1 


Given a random permutation of [n], how many fixed points do we expect it to have? 


Solution. Let A; be the indicator variable for / being a fixed point: o(/) = /. Since / is a fixed point with probability 


4, the expected value of A; is 4, so the expected number of overall fixed points is just n- 4 = 1, 


Let’s take a look at a basic graph theory problem: 


Definition 2.2 
A tournament is a complete graph with each edge directed (from one endpoint to the other). A Hamiltonian 


path is a directed path through a graph which passes through all vertices. 


Theorem 2.3 (Szele, 1943) 


For all n, there exists a tournament on n vertices with at least n!2~"+! Hamiltonian paths. 


Proof. Start with K, and randomly orient each edge. Then for each permutation of the edges, the probability that 
the edges are all directed correctly to form a Hamiltonian cycle in that order is 2~"*+ (since there are only two 


orientations). Thus, by linearity of expectation, the expected number of Hamiltonian paths is n!2~"*?, and thus there 


exists a tournament with at least that many Hamiltonian paths. 


Alon proved in 1990 that the maximum number is asympotically of that magnitude: we can have at most ay 
Hamiltonian paths. 


Let’s now start to look at some more complicated applications. 


2.2 Sum-free sets 


Definition 2.4 


A subset A of an abelian group is sum-free if there are no elements a,b,c € Awith a+b=c. 


An interesting abelian group to consider is the integers: 
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Theorem 2.5 


Every set of n nonzero integers contains a sum-free subset of size at least 3. 


Proof. Let A be a set of nonzero integers with |A| = n. Pick a real nmber @ € [0, 1], and let 


Ap= {ac Al {20} € G3)} 


(in other words, Ag contains all points with fractional part of a0 in the middle third). Note that Ag is always sum-free, 
since no two elements with fractional part in the middle third can add to a third. Now uniformly pick @ from 0 to 1: 


since the probability any a is in Ag is always 3 (since a@ ranges from 0 to a), the expected number of points in Ag is 


Be and therefore there is some sum-free subset Ag with size at least a as desired. 
The best we can do currently is oy2 and it’s been shown that (4 + c) n is not possible asymptotically for any 


c > 0. However, the constant c’ in $n +c’ is still open! 


2.3 Cliques 


Theorem 2.6 (Ramsey multiplicity) 


There exists a 2-coloring of the edges of K, with a “relatively small number’ of t-cliques: there are at most 


ai-(3) (2) monochromatic copies of K¢. 


Proof. Color all the edges randomly. The expected number of monochromatic k¢s is, by linearity of expectation, 


(7 )@ 


since each t vertices we pick has () edges and there are only 2 ways to color them to form a monochromatic K¢. 


Thus, there is a positive probability that the number of monochromatic K; is at most this number. 


Definition 2.7 


Let c, be the maximum constant such that every 2-edge coloring of K, has at least (c¢+0(1))(7) monochromatic 


t-cliques. 


In other words, c; is the best fractional bound on the number of t-cliques, and we've just found that ¢ < oi-(). 
Can we do better and find a smaller c;? 

It is known that this is tight for t = 3: Goodman's theorem implies that we indeed have c3 = ;. (Proving this is a 
good exercise in double counting.) We'd initially suspect that equality can also be achieved for t = 4, but it was found 
by Thomason in 1989 that cq < 5 < x. Likewise, the bound has been shown to be not tight for all t > 4. In fact, 
the exact value of c, is still an open problem. 

But can we prove any kind of lower bound for c¢? Specifically, what techniques do we have to proving positive lower 
bounds? In other words, we're trying to show that there's a lot of t-cliques, and that sounds vaguely like Ramsey's 
theorem. One thing we could do is find a copy, delete a vertex, and repeat, but this gives a linear number of t-cliques 


for n? edges, which isn't enough for a positive constant. Instead, we'll use the sampling trick! 
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Theorem 2.8 


Every 2-coloring of K, with n > R(t, t) contains > (Ee 


t 


eo - (7) monochromatic k;s. 


Proof. Suppose there are M monochromatic K;s in our coloring. Let X be any t-clique: then it has a probability of 
a} of being monochromatic. 

But instead, let's pick the same X in a different way. First, pick a random R(t, t) clique, where R(t, t) is the 
Ramsey number, and then pick a t-vertex subclique of that. (For this trick to work, we need to be able to pick a 
random R(t, t) clique.) This second procedure has two random steps, but by Ramsey’s theorem, there is at least one 


: : : ' , : : oe -1 
monochromatic t-clique in this second step! So X is monochromatic with probability at least ava) . 


ke ¢ Je 


So putting these together, 


This is likely far from optimal, but at least it gives us a nonzero lower bound on c;: 


Corollary 2.9 


For all positive integers ft, 


2.4 Independent sets 


Let’s turn to a new question: what is the maximum number of edges in an n-vertex K;-free graph? Note that cliques 
in a graph G are the same as independent sets in G (the graph's complement), so this is a very similar idea to what 


we've been already been discussing. 


Theorem 2.10 (Caro-Wei) 


Every graph G contains an independent set / of size 


al 
HE DS ema 


veEeG 


In particular, we should expect large independent sets out of graphs with low degrees, which is convenient for us. 


Proof by Alon and Spencer. Consider a random ordering of V, and let / be the set of vertices that appear before all 
of its neighbors in the graph. 
/ is an independent set, since no edge can connect two vertices in / (one comes before another). How big is /? By 


linearity of expectation, 


aI] = So P(v € J). 


veV 
The probability that a vertex v is in / is Taw: since there are d(v) + 1 total vertices to consider here, v and all of 


its neighbors, and v must be the one in front. So there's a nonzero probability that an independent set of size at least 


1 . 
Div Taw exists. 
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Now, let’s take the complement of Caro-Wei. Independent sets become cliques and vice versa, which yields the 


following: 


Corollary 2.11 


Every graph G contains a clique of size 


1 il 
22 SOs =o 


veEG veG 


Note that if we hold the number of degrees fixed, so 5> d(v) = 2|E], the sum is minimized when the d(v)s are as 
close as possible. 
So where's the equality case of Caro-Wei (and the corollary after it)? To have maximal independent set size and 


largest multiplicity, we want something like the following: 


Definition 2.12 
A Turan graph 7,,, has n vertices and Is an r-partite complete graph, such that each part has either |2| or 
| 2| +1 vertices. 


Note that this graph is K;41-free, and it turns out this is the extreme example: 


Theorem 2.13 (Turan's theorem) 


Given a graph G with n vertices that is K,+1 free, 


EG) S|EMa1 < (1-5) 


where the inequalities are tight if r|n. 


For simplicity, we'll show a slightly weaker result where we skip the middle part of the inequality. 


Proof. Since G is K;+1 free, by the complement of Caro-Wei, 


2|E| 


by convexity, where d is the average degree of the vertices. Since the average degree is 7? rearranging gives the 


result. 


We just have to be a bit more careful in the case where r doesn’t divide n, but it’s not too much more difficult. 


2.5 Crossing numbers 


The next example may seem a bit less familiar in terms of the techniques it uses. Given a graph G, we can draw it on 
the plane; it may or may not be planar. A graph is planar if we can draw it in a way such that all edges are continuous 


curves and only intersect at vertices. 


Fact 2.14 (“Common folklore knowledge” and Kuratowski'’s theorem) 


Ka Is planar, but Ks and K3,.3 are not. It turns out these are the only two minimal examples of nonplanar graphs: 


any nonplanar graph contains a subgraph that is topologically equivalent to Ks or K33. 
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The idea is that if we see a graph with a lot of edges, it should have a lot of crossings. How many such crossing 
must K, or Ky,n have? In fact, what’s the bound for any G with some large number of edges? 
The exact answers to K, and Ky, are famous open questions, but there are conjectures: they're called Hill's 


conjecture and the Zarankiewicz conjecture, respectively. 


Remark (Historical note). The problem of drawing the complete bipartite graph with the minimum number of crossings 
is also called Turan's brick factory problem. During World War II, Turan was forced to work in a brick factory pushing 
wagons of bricks along rail tracks. The wagons are harder to push when the rail tracks cross. This experience inspired 


Turan to think about how to design the layout of the tracks in order to minimize the number of crossings. 


The conjecture for Ky, Is to either place points antipodal on a sphere and connect geodesics, or put one set on 
the x-axis and the other on the y-axis. That makes this problem hard: two very different constructions do equally 


well. 


Definition 2.15 


The crossing number cr(G) is the minimum number of crossings in a planar drawing of G. 


Are there any bounds we can place on this? It seems like we should expect O(n*) crossings, since any 4 points 
potentially create a crossing. Is that at least correct up to a constant factor? 


We'll start by considering some facts in graph theory: 


Proposition 2.16 (Euler's formula) 


Given a connected planar graph with V vertices, E edges, and F faces, 


V—-E+F=2. 


The next few sentences are easy to get wrong, so we're going to be careful. 


Proposition 2.17 


Every connected planar graph with at least one cycle (not just a tree) has 3/F| < 2|E|. 


This is true because every face is surrounded by at least 3 edges, and every edge touches exactly 2 faces. 

Plugging this into Euler’s formula, we also find that |E| < 3|V| — 6 for all connected planar graphs with at least 
one cycle. There are some graphs that do not satisfy the conditions above, but that’s okay - from similar arguments, 
we can still deduce that all planar graphs satisfy |E| < 3|V]. 

So if there are too many edges, we want to be able to say that there are lots of crossings. Basically, every edge 
beyond the threshold of 3|V| could add a crossing, so if we delete one edge per crossing, we get a planar graph. Thus 
|E| —cr(G) < 3|V|, or 

cr(G) > |E| — 3|V|. 
But this gives O(n?) crossings for an n-vertex graph, and we're trying to show that O(n*) crossings exist. Here's 


where the probabilistic method comes in: we're going to sample like we did with the Ramsey number to get a better 


answer. 


Theorem 2.18 (Crossing number inequality) 
Given a graph G with |E| > 4|VI, 


eG) lela |Vie 
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Proof. Let p € [0,1] be a number that we will decide later, and let G’ be obtained from G by randomly picking each 
vertex with probability p. In other words, randomly delete each vertex (and the edges connected to it) with probability 
1—p. 
Our graph G’ should satisfy 
cr(G’) > |E"| — 3|V'|, 


and now take expectations of both sides: 


a[cr(G")] = E[\E"|] — 3E[|V']] 


If we start with a drawing of G, each crossing has 4 vertices that contribute to it. This crossing remains with probability 
p*, but note that after we delete some vertices and edges, we can potentially redraw the diagram to have less crossings. 


So the left hand side has an inequality of the form 
E[cr(G’)] < p*cr(G). 


The right hand side is easier: 
E{|E"|] = p°|E|, EllV'|] = alVI. 


Moving the p* to the other side now, we have a new bound: 
cr(G) > p7|E| — 3p 3|V| 


From here, we set p so that we have 4p~3|V| < p~|E|, but note that this only works if |E| > 4|V|, since our probability 


needs to be between 0 and 1. This gives the result that we want: 


Notably, if |V| = n and |E| = n? (is quadratic in nm), then cr(G) = n*: the crossing number is quartic in n, as 


desired! 


2.6 Application to incidence geometry 


Problem 2.19 


Given n points and _n lines, what’s the maximum number of incidences between them? 


Let's formulate this more rigorously: 


Definition 2.20 
Let P be a set of points and L be a set of lines. Define 


I(P,L)={(ploePxLl: pee 


to be the set of intersections between a point in P and a line in L. 


We wish to maximize |/(P, £L)]. 
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Example 2.21 
Let P be the lattice grid [k] x [2k?], and let £ be the lines with small integer slope: L = {fy = mx+b,me 
[k], b € [k?]. Then every line in £ contains k points, so 


|(P,£)| =k, 


which gives O(n*/) incidences. 


The natural question to ask is whether this is optimal, and the answer is yes. To prove this, let's start trying to 


find some upper bounds. Assume temporarily that every line has at least two incidences: clearly, there is a bound 
I(P,L) < |P\ZI, 


which is weak If there are at least 2 points or 2 lines. But let’s use the fact that there is at most one line through each 
pair of points: to do this, we'll double count the number of triples (p, p’,2) € P x P x L with p # p’ and p,p’ € &. 
On one hand, given two points, we've determined the line, so there are at most |P|? such triples. On the other hand, 


if we count the incidences in terms of lines, the number of triples is 


> I(P, LY? 


So IP na(jPng| — 1) Ta 


lel 


I(P, L) 


where we've done bounding by Cauchy-Schwarz. Putting these together, 
(PL) S |Pllel"? + |eI. 


By point-line duality, we can also find an analogous statement if we flip L and P. Either way, for n lines and n points, 


we're getting O(n?/?), which is not as strong as O(n4/3). 


Remark. We can make this bound that we found tight in some situations, though: it turns out this is the right number 


of incidences over a finite field F2 if we take all ©(q?) lines and all q* points. 


Back to the Euclidean plane. To make the bound tight, we invoke the topology of Euclidean space and the 
crossing number theorem. Assume, again, that every line has at least 2 incidences. Draw a graph based on the 
point-line configuration, where the points are vertices and consecutive points on a line form an edge. So each line 
gets chopped up into some number of segments. 

How many edges and vertices are there? The points are vertices, so |V| = |P|. A line with k incidences (and 


k >2)hask—1> x edges, so the number of edges is at least 


I(P, L) 


E| > 
El> 


Two lines can cross at most once, so 
cr(G) < |£I?. 


Provided that the number of incidences is at least 8 times the number of points, we can invoke the crossing number 


inequality: 

3 3 
JEP. MP.LIE 
[Ve PP 


\C?| = er(G) 2 


Rearranging, this gives us 
(PL) 5 |PPP le, 
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but this only works if we have a sufficiently large number of incidences, so we need to add a linear |P| term. We also 
need to correct for the fact that we're assuming that there are at least 2 incidences per line, which adds a linear |L| 


term: 


Theorem 2.22 (Szemerédi- Trotter theorem) 


For any set of points and lines, 
KP, L) S(PPPRCPA + [P| + IL. 


This is sharp up to constant factors! As a corollary, n points and n lines always have O(n*/3) incidences. 


2. Derandomization: balancing vectors 


We'll start by solving a problem with familiar techniques: 


Theorem 2.23 


Given v1,--: , V7 € R” unit vectors, there exists €1,€2,--: ,€, € {—1, 1} such that 


Jervy +-+- + €_Vp| < Vn. 


This is motivated by considering v1,--- ,V, to be a standard basis: our choices can't get the length of the vector 
any smaller than \/n. As a sidenote, we can also show that we can pick the ejs to make the length at least \/n. 
We want to use linearity of expectation, but we have a small problem: we have an expectation of an absolute value. 


The easiest way to get around this is to square both sides of our equation! 
Proof. Let 
X = ery +--+ + Envnl?, 
and pick each ¢€; independently and randomly between {—1, 1}. X expands out to the sum 
n 
X= S- 62; (4 + vy) 
ij=l 


and now that the absolute values are gone, we can just use linearity of expectation: for i # /, the expectation is 0, 


and for / = j/, we get a contribution of 1 - |v; 2 = 1 from each term. So the expected value of X is n, so with some 


positive probability X < n (and also X > n). 


We can also do this all deterministically: in this case, we don’t actually have to use the probabilistic method. 


Finding the €;s algorithmically. We're going to pick our €;S sequentially and greedily. At each step, we pick the e; that 
minimizes the expected value conditional on the previous choices. 


For example, if we pick €1,--- ,€«—1, let w = €yvyy +--+ + €x—1Ve~—1. Then our conditional probability 


D(X | e1,-++ ex] = E [wt exve + Enval? | €1,-++ ex], 
and expanding out the square again, this becomes the expected value of 


|w|? + 2e,(W+ ve) + (n—k—1). 


To minimize this value, we pick €, = 1 if and only if w- vy, <0. 
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Why couldn't we do something like this for the Ramsey number proof, too? The idea is that we can’t compute 
the number of cliques of other subsets easily! (It is “expensive” to do so.) This idea of turning probabilistic proofs into 


deterministic ones is called derandomization. 


2.8 Unbalancing lights 


Problem 2.24 


Consider a grid of nx n lights, where we only have light switches for each row and column. How can we maximize 


the number of lightbulbs turned on given some starting configuration? 


Represent this as an array of +1 numbers. Let aj € {—1,1} for all 1 < /,j < n, and let’s say that our light 


switches are labeled x1,--* , Xa, Yis°** + Yn € {—1, 1}. Our goal is then to maximize the quantity 


n 
S aijXi Yj, 


ij=l 
since only the parity of how many times we flip each switch matters (not even the order). 
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Well, there are n° variables, so if we do our probabilistic method naively at random, we can guarantee a linear 


answer inn, since Vn? = n. But we can do better than that: 


Theorem 2.25 
Given fixed aj € {—1, 1}, we can pick x1,--+ , Xn, Yi,°+* » Yn € {—1, 1}, such that 


n 
2 
S- aX Vj = (\ = + 7) noe 


ij=l 


Proof. Choose y1,--+ , Yn € {—1, 1} randomly: this means that we pick a random way to flip our columns. Now, for 


each row, we can choose x; such that the ith row sum is nonnegative (in other words, flip a row if the sum is negative). 


n 
Rj = > aijYj, 
j=l 


and our final sum is just R = Sa |R;|. Here we use linearity of expectation: the expected value of each R; is the 


Each row sum Is 


same, and each R; is a sum of +1s. This gives a binomial distribution: we can use the Central Limit Theorem, since 


E (=) + E|X| = qe 


una-o™(f5) 


and use Stirling's formula.) Regardless, each row has expected value (\/2+ (1) /n, which is what we want. 


our quantity 


(Alternatively, we can directly compute 


2.9 2-colorings of a hypergraph 
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Theorem 2.26 


Let a k-uniform hypergraph have a vertex set V partitioned as 


where |V;{ = 1 for all 7. Suppose the edges of the complete k-uniform hypergraph on V are colored red and blue 


such that every edge that intersects all of V1,--- , Vx is colored blue. Then there exists a subset of the vertices 
S CV such that 
|# blue edges — # red edges| > cn 


for some constant k. 


For example, if k = 2, we're looking at a 2-coloring of a complete graph where all of the cross-edges between 
two halves are blue: our goal is to get a large difference in the number of red and blue edges. Similarly, if k = 3, we 
partition 3n vertices into three parts and draw triangles. All the triangles that intersect all three parts are blue, but 


everything else can be red or blue. 


Proof. The idea here is to choose S by including each vertex in a given V; with probability p;. We'll leave py, Po,--+ , Dx 
undetermined for now. 

Let’s do the proof for k = 3 for illustration, but this generalizes to any k. Let axyz be the difference in the number 
of blue and red edges in Vy. x V, x Vz. When we randomly pick our vertices, by linearity of expectation, the expected 


number of blue minus red edges is 


3 
IT Pi P2P3 + S AxyzPx Py Pz. 
x<ySz 
not all different 


The first term here comes from the forced blue triangles between all Vjs. Our goal is to show this absolute value of 
this expression is (at least) cubic, and then we'll be done by linearity of expectation. 

We haven't chosen our pjs yet, and for each specific choice, we might end up with expected values that are pretty 
close to 0. So there is always a graph that beats a specific set of p;, but we just want to find pi, Po, p3 that work 


given a graph. This is now just an analysis problem: 


Lemma 2.27 
Let P, denote the set of polynomials of the form g(pi,--- , Px) with degree at least k and coefficients having 


absolute value at most 1, where the coefficient of p1p2--- px is exactly 1. Then there exists a constant c, such 


that for all polynomials in Px, there exists pi,--- , px € [0, ills such that 


Q(P1, Po,++* » Pk) > Cx. 


The proof of this is short: let M(g) be the supremum 


sup —- |9(P1.-+* . Pk) 
Pi? PeE[0,1]* 


By continuity and compactness, this is actually an achieved maximum, and it Is always positive, since all polynomials 
are nonzero. Furthermore, this map M : P, — R is continuous on a compact domain, so it must achieve its minimum, 
which is nonzero. 


This doesn't give a concrete value of cx, but it tells us that one exists! And now we're done with the linearity of 


expectation argument, since all ajjx < n. 
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The main take-away here is that we decide probabilities for our random process in the last step, since no probabilities 


will work for every configuration. 


2.10 High-dimensional sphere packings 


Problem 2.28 


What is the densest possible packing of unit balls in R’? 


This has been solved for n = 1 (trivial), n = 2 (a rigorous proof wasn't found until the middle of the 20th century), 
and n = 3 (Kepler's conjecture; proved with computer assistance in the 1990s, and a formal computer proof was 
recently completed). 

Recently, there was a breakthrough that found the answer for n = 8 and n= 24 as well; those answers come from 
the Eg and Leech lattices respectively. However, the problem is open in all other dimensions. 


The definition of “density” can be thought of pretty intuitively: 


Definition 2.29 


Let A, be the maximum fraction of space occupied by non-overlapping unit balls in a large box in R” as the volume 


of the box goes to infinity. 


We wish to understand bounds on A,. What are examples of good sphere-packings with high density? 


Example 2.30 


Consider a packing where we pack greedily: we keep throwing balls in wherever there is space. Alternatively, take 


any maximal packing: basically, find one where we can't fit any additional balls in R” anymore without overlap. 


What can we say about the density of such a maximal sphere packing? Well, double the radii of every ball, and 
suppose there is a spot not covered. Then we could just put a unit ball centered at that spot which doesn't intersect 
any of our initial balls, contradicting maximality of our packing. Thus, we must be able to cover all of R” with doubled 
radii, and thus 

2"A, > 1, so A, > 27”. 


For comparison, what's the packing for Z”? We can put a ball with radius 5 at every lattice point, and the density 


is Just the volume of a ball of radius 5. This is a pretty standard formula: it’s 


Q-Ng_n/2 
~ trae =" 


—cn 


V 


so the integer lattice does very poorly compared to the “random” lattice. Are there better ways to construct lattices 


in higher dimensions? Here’s the best bound we know at the moment: 


Theorem 2.31 (Kabatiansky—Levenshtein, 1978) 


The sphere-packing density in R” is at most 2~(0-599+0(1))7_ 


Where does the probabilistic method come into our picture? Although we can't prove the above fact, we want to 


at least get a better bound than 27”. 
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Definition 2.32 


A lattice is the Z-span of a basis in R”: given vy, Vo,--- ,Vq, We Can write a matrix with basis vectors as columns. 


A lattice is unimodular if the covolume (volume of the fundamental domain) is 1, which means the matrix has 


determinant +1. 


Let’s consider matrices A such that detA = 1, so A € SL,(IR). On the other hand, given a lattice, there’s 
different ways to represent it with a basis: we could always pick (v1 + V2, V2,°*+ , Vn) instead of (Vy, V2,--+ , Vp). Any 
such transformation is matrix multiplication of B € SL,(Z). 

So the whole point is that lattices are matrices in SL,(IR)/SL,(Z) through row reduction. Our question: is there 


a way to pick a random lattice here? 


Fact 2.33 
The space has a finite Haar measure, so there exists a (normalized) probability Haar measure on SL,(R)/SL,(Z), 


which allows us to choose a random point in the space. That random point will be our random lattice. 


Theorem 2.34 (Siegel mean value theorem) 
If L is a random unimodular lattice in R” (chosen as above according to the Haar probability measure), and if S 
is any measurable subset of IR”, then 


E({LA(S \ {0})|) = vol(S). 


The idea is that the average point density is 1, so the number of nonzero lattice points is the volume. We exclude 


0 because it’s always in the lattice. 


Proof sketch. Observe that the function S + E(|LM(S \ {0})|) is additive, so it is a measure. Because of how we 
chose our lattice, it is SL,(IR)-invariant, so the measure is also SL,(R) invariant. Therefore, the only measures that 
work are constant multiples of the Lebesgue measure. 


Now imagine we take a very large ball, much larger than the size of our lattice: then the expected value is the 


volume minus some boundary errors. So |SML| ~ vol S and the normalizing constant must be 1. 


How do we use this to find dense lattices? 


Proposition 2.35 


There exist lattices with sphere packing density greater than 27”. 


Proof. Let S be a ball of volume 1 centered at the origin, and pick a random lattice. By the Siegel mean value 
theorem, the expected number of nonzero lattice points of L that are in S is 1 (think of this as 1 —e€). We can show, 
then, that there must exist L such that L has no nonzero lattice points in S, since there is a positive probability that 
there is more than 1 lattice point. 

So now put $S around every point of L; this gives us a packing with density 2~". But notice that the nonzero 
lattice points come in pairs {x, —x}! In other words, we can take S to be a ball of volume 2. Then we can guarantee 


the expected number of nonzero lattice points is 2, and we can't have exactly 1 lattice point, so we have the same 


conclusion as before. This yields a sphere packing with density 2'~”, and this improvement is due to Minkowski. 
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Can we do better? There's a lot of connections to the geometry of numbers here. There was a long sequence of 
improvements made, all of the form An > cn27—", over a few decades. c went from 5 to about 2, but then Venkatesh 
realized that we can gain factors of k if we have additional symmetry in our lattices: number theory gives such lattices 
with k-fold symmetry! 

For example, consider the lattice corresponding to a cyclotomic field: that is, look at the lattice spanned by a Ath 
root of unity w. This has a k-fold action, which is multiplication by w. The end result is that a “random lattice” can 
be extended to a random unimodular lattice in dimensions n = 26(k), with k-fold symmetry, also satisfying the Siegel 


mean value theorem conditions. So now k-fold symmetry gives density 
An > k an", 


and this turns out to maximize the gain when k = pi P2--- Pp, where p; is the /th prime. Number theoretic calculations 


give the following result: 


Theorem 2.36 (Venkatesh, 2012) 


There exists a lattice packing of unit balls of density 


A, > cnloglogn- 27" 


for infinitely many values of n and some c > 0. 


These values of n are very sparse, but this is the state-of-the-art bound. Venkatesh also used a different method 
to show that (for all sufficiently large n) 
A, > 60000n- 27”. 


It's an open problem whether or not we can get sphere packings of exponentially better density than this, though! 
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3 Alterations 


Recall the naive probabilistic method: we found some lower bounds for Ramsey numbers in Section 1.1, primarily for 
the diagonal numbers. We did this with a basic method: color randomly, so that we color each edge red with probability 


p and blue with probability 1 — p. Then the probability that we don’t see any red s-cliques or blue t-cliques (with a 


(2) + @re pie): 


and if this is less than 1 for some p, then there exists some graph on n vertices for which there is no red K, and blue 


union bound) is at most 


K,. So we union bounded the bad events there. 
Well, the alteration method does a little more than that - here’s a proof that mirrors that of Proposition 1.6. We 
again color randomly, but the idea now is to delete a vertex in every bad clique (red K, and blue K;). How many edges 


have we deleted? We can estimate by using linearity of expectation: 


Theorem 3.1 
For all p € (0,1),n EN, 


Rist) > (7) oe ("Ja — py). 


This right hand side begins by taking the starting number of vertices and then we deleting one vertex for each 


clique. We're going to explore this idea of “fixing the blemishes” a little more. 


3.1 Dominating sets 


Definition 3.2 


Given a graph G, a dominating set U is a set of vertices such that every vertex not in U has a neighbor in U. 


Basically, we want a subset of vertices such that every vertex is either picked or adjacent to something we picked. 
Clearly the whole set of vertices is dominating, but our goal is to find small dominating sets relative to the number of 


vertices. 


Theorem 3.3 


If our graph G has n vertices and minimum degree 6 among all vertices (6 > 1), then G has a dominating set of 
log(6+1)+1 
o+1 ‘ 


size at most 


Proof. We will do a two-step process. First, pick a random subset X by including every vertex with probability p. 
Then, add all vertices that are neither in X or the neighbors of X (since those are the ones we haven't covered with 
our set yet); call this Y. By this point, we have a dominating set X UY by construction. 

Now, how many vertices do we have in our dominating set? Any vertex v is in Y if neither v nor any of its neighbors 
are in X. So v has probability (1 — p)4¢9(¥)+! < (1 — p)!*® of being included in Y, meaning that the expected size of 
X UY is 


DX] + E[Y] = np + n(1 — p)**°. 
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Now we just optimize for p. The important computational trick is that we can bound this pretty well if p is small: 


< np + ne~P+8) | 


log(6+1) 
6+1 


Turns out the optimal value is p = , and this gives the result we want. 


3.2 A problem from discrete geometry 


Problem 3.4 (Heilbronn triangle problem) 


Place n points in the unit square. How large can we make the smallest area of any triangle formed by our points? 


This is related to the ideas of discrepancy theory. There are applications when we want to evenly distribute 


points, and this is one way of quantifying that randomness. 


Definition 3.5 
Let A(n) be the minimum real number such that for any n points in the unit square, there are three points with 


triangle area at most A(n). 


For example, it’s bad to have a square grid of points, since we get a minimal area of 0. If we put the n points on a 
circle, we get an area on the order of =, which is at least nonzero. The whole point is that we don’t want collinearity, 
so it's hard to think about an efficient picture that is “irregular.” 

Heilbronn conjectured that A(n) < n~?, but this was disproved in 1982 by KPS: they showed A(n) = Bae On the 
other hand, the best known upper bound is < nr te(d) 


Below, we use a randomized construction to show that A(n) > n-?: 


Proposition 3.6 


There exist points in a unit square such that every three form a triangle with area at least cn~* for some 


constant c > 0. 


Proof. Choose 2n points at random (uniformly in the unit square). How can we find the probability that the area of 
a triangle pqr is at most €? 

Pick p first. The probability that the distance between p and q is in the range [x, x + Ax] is the intersection of the 
square and the annulus with bounds x and x + Ax, which is always at most ©(xAx) (by taking Ax to be small). 

So now, if we fix p and q, what’s the probability that our area is less than €; that is, the height from r to line pq 
is small? This means we want the distance between line pg and point r to be at most aetay which is bounded by a 
constant times £ (because the allowed region is bounded by a rectangle with height 4e and length 2). 


Putting these together, the probability that the area is at most € can be bounded by a factor proportional to 


Woe 
| x-—dx Se. 
0 x 


So now we apply the idea of the alteration method: let X be the number of triangles with area €, and delete 1 


point from each triangle: let's say we delete x triangles. What's the expected number of points that are removed? We 


remove E[X] x en? points, and we'll pick € = “S for some constant c such that the expected value of x is < n. Now 


with positive probability, our process deleted fewer than n points, so we have at least n points with no small triangles 


of area less than ai and we're done. 


25 


Actually, we can also do a direct algebraic construction. Let's say we want to find n points in a square grid with 
no three points collinear. Note that a lattice polygon has area at least 5, so take n = p to be a prime number, and 
let our points be {(x, x?) : x € F,} in FS. Parabolas have no three points collinear, and thus we've constructed 
configurations with smallest area proportional to n~? explicitly. 

So the idea is that although algebra solutions are pretty, it’s often hard to modify algebraic constructions, while 


combinatorial proofs let us use heavier hammers. 


3.3. Hard-to-color graphs 


There are many problems in combinatorics for which probabilistic constructions are the only ones we know. Here's an 


example that Erdds studied. 


Definition 3.7 


The chromatic number x(G) of a graph is the minimum number of colors needed to properly color G. 


If we look at a very large graph and look at it locally, we can set some lower bounds on the chromatic number. 
For example, a Ka means that x(G) > 4. Our question: is it possible to use local information to find that x(G) is 


upper-bounded? Turns out the answer is no! 


Definition 3.8 
The girth of a graph G is the length of the shortest cycle in G. 


Theorem 3.9 (Erdés) 


For all positive integers k and £, there exists a graph of girth more than @ and chromatic number more than k. 


The idea is that for graphs with large girth, we only see trees locally, and that won’t tell us anything. So the 


chromatic number is (in some sense) a global statistic! 


Theorem 3.10 (Markov’'s inequality) 


Given a random variable X that only takes on nonnegative values, for all a > 0, 


Prix >a) < a 


Proof. 
E[X] > E[X : 1x>a] > E [alx>a] = aPr(X > a). 


This is used with the mindset that if the expected value of X is small, then X is small with high probability. 


Proof of Theorem 3.9. Construct an Erdés-Renyi random graph G(n, p) with n vertices and each edge appearing 
with probability p. Here, let's let 


1 
p= *,0<0< 5. 
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Let X be the number of cycles of length at most 2. By expected value calculations, the number of such cycles is 


[X] =. (‘) voy 


i=3 


since given any / vertices, there are uy different cycles through them. This can be upper bounded by 
é * * 
a nip cine. 
i=3 


Plugging in our choice of p, this evaluates to 
én™ = o(n) 


by our choice of 8. Now, what's the probability we have lots of short cycles? By Markov’s inequality, 


Pr (x > =) < Lx] = 0(1), 


2/ — n/2 

so this allows us to find a graph with no cycles of length at most @ by the alteration method. 
Meanwhile, what about the chromatic number? The easiest way to lower bound the chromatic number is to upper 
bound the independence number a(G), which is the size of the largest independent set. Note that every color class is 


an independent set (since no two vertices with the same color share an edge), so 
IV(G)| < x(G)a(G), 


which is good for us as It gives a lower bound on the chromatic number. Well, the probability that we can have an 


independent set of size at least x is 
Pr(a(a)> x) < (")a—»)@, 


and if this quantity is small, we're good to lower bound the chromatic number. With more bounding, 
Pr(a(G) > x) < nte PROD /2 = (ne P*-D)/2)x 


and by setting x = 2 log n, this quantity becomes o(1) as well. 

We're almost done. Let n be large enough so that we have few cycles and large independent set size with high 
probability: X < 3 and a > x, each with probability greater than 5. There now exists G with at least 4 cycles of 
length £2 and a(G) < 3 log n, and now remove a vertex from each short cycle (of length 2) to get a graph G’. The 


number of vertices of G’ is now at least a since we only removed at most q cycles worth of vertices, and 
’ 3 
a(G") < a(G) < —logn, 
p 


so 
NAC 2 n? 


G')> = k 
x(G') 2 a(G’) ~ 6logn 6logn~ 


for some sufficiently large n, and therefore G’ is the graph we're looking for. 


3.4 Coloring edges 


Recall that we defined m(k) in Section 1.5 to be the minimum number of edges in a k-uniform hypergraph that is not 


2-colorable. (Basically, we want to color the vertex sets red and blue so that no edge is monochromatic.) We found 
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an upper and lower bound earlier: a randomized construction gives m(k) < k?2 using k? vertices, and m(k) > 2‘-1, 


just by randomly coloring the vertices, since each edge fails with some probability. Let’s improve this lower bound now: 


Theorem 3.11 


Proof. Let's say a hypergraph H has m edges. Consider a random greedy coloring: choose a random mapping of the 
vertices to [0,1], and go from left to right, always coloring blue unless we would create a blue edge (in which case we 
color red). 

What's the probability this gives a proper coloring? The only possible failures are red edges: call two edges e and 
f conflicting if they share exactly one vertex, and that vertex is the final vertex of e and first vertex of f. The idea 
here is that any failure must give a pair of conflicting edges. 

So what's the probability that such a pair exists? Let’s bound it: given two edges e and f that share exactly one 
vertex, the probability that they conflict is 

(k —1)I? 1 


MeO OR Ok DG) 


22k/k 


Asymptotically, (,/2) iS a up to a constant factor, so the probability that these two edges conflict is © ( 2 ) Now 
if P(e, f) is less than a we're happy, because there's less than m? edges and we can union bound the bad events. 


Doing some algebra, this gives 
m(k) > k¥/49k, 


Now let’s be more clever. Split the interval [0,1] into L = [0,+42],M = [48, 42],R = [42.1]. A pair of edges 
that conflict must have eC L,e CR, f CL, or f C R, or they both intersect in the middle. 


The probability that e lies in L is just pee): (each of the k vertices must be in L), and we can say similar things 
about the cases eC R,f C L,f CR. To deal with the middle intersection, if the common vertex between e and f Is 
v, the probability that the second scenario happens is the probability that there are (k — 1) vertices to the left of v in 
M for e and (k — 1) vertices to the right of v in M for f. This is bounded by 


(1+p)/2 1\ «2 
x= x) "dk <p (3) 
(1—p)/2 


Putting all of this together, the probability of any pair of conflicting edges is bounded by 


1—p\* ft k-1 
<2 
tg) a) 


and p= (log 4m) /k, and we've found a bound on m as desired. 


, 
log k 


and this is less than 1 if m= c2* 
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4 The Second Moment Method 


Starting in this section, we shift the focus to that of concentration: essentially, can we say that the value of our 


random variable X is realtively close to the mean? 


4.1 Refresher on statistics and concentration 


We've been discussing expectations of the form E[X] so far, and let’s say that we find E[X] to be large. Can we 


generally conclude that X is large or positive with high probability? No, because outliers can increase the mean 
dramatically. 


So let’s consider a sum of variables 
X=Xi+Xot--- +X, Xj ~ Bernoulli(p). 


If the X;s are independent, we know a lot by the central limit theorem: a lot of random variables will converge to a 
Gaussian or other known distribution in the large limit. But most of the time, we only have that our variables are 
“mostly independent” or not independent at all. Is there any way for us to still understand the concentration of the 


sum? 


Definition 4.1 


The variance of a random variable X is defined to be 


var(X) = E[X — E[X]]? = E[X?] — E[x/?. 


We will often let 2 denote the mean of a variable, a? denote the variance, and define o to be the (positive) 


standard deviation of X. 


Proposition 4.2 (Chebyshev's inequality) 


Given a random variable X with mean w and variance a2, then for all A, 


il 
Pr(|x — | = Ag) < ne 


Proof. The left hand side is equivalent to 
Pr((x — 11)* > A707) 


which, by Markov’'s inequality, is 


[lx —uI?] _ 9? 1 


02 02 2 


Why do we care about these results? The central idea is that if our standard deviation o < pw, then we have 


“concentration” of polynomial decay by Chebyshev. 
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Corollary 4.3 (of Chebyshev) 


The probability that X deviates from its mean by more than € times its mean is bounded as 


Pr(|X — E[X]| > eE[X]) < at 


In particular, if var(X) = o(E[X]*), then X ~ E[X] with high probability. 


Usually, variance is easy to calculate. This is because 
var(X) = cov[X, X], 


where cov[X, Y] is the covariance 


[(X — E[X])(Y — E[Y]) = E[XY] — E[X]E[Y]. 


Since this expression is bilinear, if X = X, +---+X,, we can expand this out as 
x cov[X;, Xj] = be var(X;) + 2 a cov[X;, Xj] 
ij i ca 
Often the second term here is small, because each Xj is independent with many other Xjs or there is low covariance 


between them. 


Example 4.4 (Binomial distribution) 


If X =X, +X2+---+Xn, where each X; is independently distributed via the Bernoulli distribution Bernoulli(p), 


the mean is E[X] = np, and 0? = np(1—p). As long as np > 1, we have o < p, So X ~ w with high probability. 


Later on, we'll get much better bounds. (By the way, we want np > 1 so our distribution doesn’t approach a 


Poisson distribution instead) 


Example 4.5 


Let X be the number of triangles in a random graph G(n, p), where each edge of K, is formed with probability p. 


Is this variable concentrated around its mean? It's pretty easy to compute that mean: X Is the sum over all 


m= Se 


ij, kE[n] 
distinct 


triangles 


where Xjjx is 1 if they form a triangle and O otherwise. Each Xj; can be expanded out in terms of the indicator 
variables for edges: 
x= SO Xe 


ij, kE[n] 
distinct 


By linearity of expectation, each term is p®, so E[X] = (in The variance is a bit harder, and we're mostly worried 


about the covariance term: when do those cross-terms come up? 
Well, given a pair of triples 71, 72 of vertices, we can find the covariance for those triangles. If there is at 


most one vertex of overlap, no edges overlap, so there is no covariance. The others are a bit harder, but we use 
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cov[X, Y] = E[XY] — E[X]E[Y]: 


0 lTaN To] <1 
cov[X7,, Xi] = p° _ p® |Ty a) To| =2 
p?—p? T1=Te 


So we can now finish the computation: 


var(x) = (3 (6° 0°) + (2) 2)(n— 3)(p5 — p®) < np? + ntp®, 


and we have 0 < yu if and only if p > 7 So this means that the number of triangles is concentrated around its mean 


with high probability if p is large enough! Later in the course, we will use other methods to prove better concentration. 


Fact 4.6 


It turns out that X satisfies an asymptotic central limit theorem: 


—— 
Seo) 
oO 


This fact was initially proved by taking moments of the form E[X”"], and the idea is that if the moments agree 
with the Gaussian moments, we have a Gaussian distribution. But there’s a newer method that can be used called the 


method of projections. 


4.2 Threshold functions for subgraphs 
We're going to try to look for small subgraphs in a large random graph G(n, p). Here’s an example: 


Problem 4.7 
For which p = pp (a sequence in terms of n) does G(n, p) have a Ky subgraph with high probability 1 — o(1)? 


Lemma 4.8 
For any random variable X that takes on nonnegative values, 


var(X) 


Pr(X =0) < EDX. 


Proof. The probability that X = 0 is at most the probability |x — u| > u, which is at most vat by Chebyshev's 


inequality. 


Corollary 4.9 
Let X take on only nonnegative values. If the variance of X is much smaller than 2, then X > O with high 


probability. 


Definition 4.10 


r(n) is a threshold function for a property P if p = pp < r(n) means that G(n, p) satisfies P with low probability, 


while p = Pp >> r(n) means that G(n, p) satisfies P with high probability. 
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Proposition 4.11 


The threshold for a random graph to contain K3 (triangles) is 1, so the probability a graph contains a Kg is 0 if 


pn—0Oand 1 if pn > ow. 


Proof. Let X be the number of triangles in G(n, p). Recall that 


343 
= (5)e° ~ 1 = var(X). 


Ifp< t, the mean us = o(1), so by Markov's inequality, the probability X has at least one triangle vanishes: 


Pr(X > 1) < E[X] = o(1). 


On the other hand, if p > 2, LL — oo, while o < pw. So X is concentrated around its mean with high probability, 


making it positive with high probability. 


Problem 4.12 


Given a subgraph H, what’s the threshold for containing H? 


Let X = X, +---+ Xm, where each X; is an indicator variable for A;. We let i ~ / for i 4 / to mean that A; and 
Aj are not independent. So if / 4 j, then cov[X;, Xj] = 0, but if i~y, 


cov[X;, Xj] = E[X;X;] = aX 


JE[Xj] < E[X;Xj] = Pr(A; 0 Aj). 


So expanding out the expression for variance, 


var(X) = > cov[X;, Xj] < E[X]+A, 


where A is defined as (the bounded covariance term) 
S> Pr(Ain Aj). 
i<jinj 
So we approximate covariances by probabilities, but if there are very few dependent pairs, we really just care about the 


number of them. It’s possible that all the X;s are all correlated, or that Xjs are all nearly independent, but that’s not 


the case here. 


Corollary 4.13 


If E[X] > oo and A = o(E[X]*), then with high probability, X is positive and concentrated around its mean. 


Simplifying A, 
A= S> Pr(Ain Aj) = $2 Pr(Ai) $2 Pr(AilAi) 


i<jiny i jJi~i 


and usually the inner sum doesn’t depend on / by symmetry. In such cases, we can define 


A* = S> Pr(AjlAi). 


sini 
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We then have 


A= > Pr(Aj)A* = A* - E[X], 


and this means that if E[X] > oo and A* < up, X is positive and concentrated around its mean with high probability. 


Proposition 4.14 
— 2/3 


The threshold for having K4 as a subgraph is n 


Proof. Let X be the random variable which is the number of Kq4 graphs in G(n, p). The expected value of X is 


4,6 
Sie GO Ve 
X= (je ~ SE 


and if p< n-2/3, then up = o(1), so again by Markov, X is 0 with high probability. 


On the other hand, if p > n-2/3, the mean goes to infinity, and we'll look at the second moment by letting As be 


the event that we induce a Ky on any set S of four vertices. then 
A* < np? + np?, 


where n@p° comes from sets sharing two vertices (which means we need to find two more and have 5 edges chosen 


with probability p), and np? comes from sets sharing three vertices (meaning we find one more and have 3 more edges 


chosen). Provided that p >> n~?/3, both terms here are small: A* = o(E[X]), and we are done by Corollary 4.13. 


So it seems we should be able to do this with any graph H. But the idea with K3 and Ky was that any p with 
Lt — co gave X > 0 with high probability. In general, the answer isn't quite so simple. 


Question 4.15. Consider a Ky with an extra edge attached to a vertex as the subgraph that we're looking for. What 


is its threshold density? 


5/7 


The expected number of copies of this is ELX] = n°p’, so we might predict that the threshold is p = n~ 


5/7 


Indeed, if p< n~*/7, #[X] is very small, and we have zero copies with small probability. But now let's say p >> n7 
-2/3 


but p< n-2/3. There are no Kas, so there’s no way we can have this graph at all. Finally, when p >> 1 , we have 
a bunch of Kags: it can be shown that we can easily find another edge to connect to our K4. Therefore, the threshold 
density is n~2/3, and that threshold is not just dependent on the number of edges and vertices of our subgraph H! 

In a way, this is saying that K4s are the “hard part” of the graph to hit, and the next definition helps us quantify 


that. 


Definition 4.16 

Define p(H) = Fa sometimes called the density of H, to be the ratio of edges to vertices in our graph H. H Is 
balanced if every subgraph H’ has p(H’) < p(H). If H is not balanced, define the maximum subgraph density 
m(H) to be the maximum of p(H’) across all subgraphs H’. 


Example 4.17 


Cliques are balanced: the initial density is a and we can’t do better. On the other hand, the K, plus an edge 


is not balanced, since p = Z but the p of Kg is 3. 


In fact, m(H) is actually what designates the threshold density: 
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Theorem 4.18 


If we pick each edge of K, with probability p, the threshold for having H as a subgraph is p = ny 


The proof is very similar to what we've been doing. 


Proof. Let H’ be the subgraph with maximum density o(H’) = m(H). If p is below the threshold, the expected number 


of copies of H’ 


nx] = np = of1), 


so with high probability G(n, p) has no copies of H’ and therefore no H. 


Now if p>> n—1/™4) we want to compute the number of copies of H. For sets S of vertices with |S| = vy, 
At = S- Pr(Ar|As) 
T:|T|=vy,|TNS|>2 


where T is the event that T contains a copy of H. 
Doing cases based on the size of TMS (like we did before), let’s say T intersects S in k spots. Here's the key 
step where we use the maximum subgraph density: overlaps in the covariance terms are subgraphs of H. If H’ is the 


overlap between S and 7, the contribution to A* is 
. nn pei < nV pet 


for all H’, so if we keep track of all the overlaps, we find that A* = 0(1), meaning all overlaps don’t contribute much. 


This finishes the proof by Corollary 4.13. 


4.3 Clique number 


Question 4.19. What can we say about w(G), the number of vertices in the maximum size clique of G, if each edge 


in Ky Is included with probability $? 


We can't quote any of the results from last time, since we're not sticking to fixed-size subgraphs. But this Is still 
not too hard to calculate from first principles. 
Let f(k) be the expected number of k-cliques: this is just (n)2-G) by linearity of expectation. We can have a 


naive guess: perhaps we have a clique whenever this quantity goes to infinity and not when the quantity goes to 0. 


Theorem 4.20 
Let k = k(n) be a function such that f(k) = (7)2-G) goes to infinity. Then 


w(s(02)) 4 


with high probability. 


Proof. For all subsets S of the vertices of size k, let As be the event that S is a clique, and let x5 be the indicator 


X= S°Xxs 
S 


has expectation f(k), and we want to show that the variance is much smaller than the mean squared. This is very 


variable for As. Then the number of k-cliques 


similar to the earlier proof: fixing S, we can find A* by summing over all 7 that intersect S in at least two vertices 
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(those are the only ones that can be dependent on S): 


i= S- Pr(AT|As). 


T:|TNS|>2 


We can write this down explicitly, since the expression Pr(A7|As) just depends on the size of the intersection: 


“> ((*\ (12) 20-0 


where the first term is the number of ways to choose T with an overlap of / vertices, and the power of 2 is the 
probability that T is a clique given that the / vertices in S are all connected. This does indeed turn out to be small 


enough: omitting the detailed calculations, 


At < ({)2 (1) = EX], 


so we're done. 


We also know by Markov’'s inequality that if the expected value goes to 0, the probability of having a k-clique is 
o(1). The idea is that if there's some value k such that f(k +1) >> 1 and f(k) < 1, then we have a distinctive 
threshold. But it might be that one of the fs is constant order, and then the theorem doesn’t actually let us know 


what happens for that specific value of k. 


Theorem 4.21 
There exists a kg = ko(n) such that with high probability, 


W (c (n 5)) € {ko, ko + 1} 


and ko ~ 2logo n. 


This is known as two-point concentration. Rephrasing this, if we create this graph at random, we expect one of 


two values for the clique number. 


Proof sketch. We can check that for k ~ 2logs n, 


f(k+1) _n—k 


= 97k = —1+0(1) =_ 1). 
F(k)  k+1 ” ot) 


(In particular, the gap between two adjacent ks is too large to allow a bunch of ks to give constant order f(k)s.) Then 
let kg = ko(n) be the value such that 
f (ko) = > f (ko + 1); 


then f(ko — 1) >> 1 and f(ko +2) <1. 


It turns out for most but not all values of n, there is only one kp that w takes on with high probability! Later in 


this class, we'll be able to say something more specific. 


4.4 Chromatic number 


Question 4.22. What is the expected chromatic number (maximum number of colors needed for a proper coloring) 


in a random graph G (n, $)? 
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Remember that we have the result x(G)a(G) > n, because each color class is an independent set (and therefore 


i n 
one of them has size at least x(6)): 


Corollary 4.23 


The expected independence number of G Is also ~ 2logs n, since 


a(G) =w(G), 


since including an edge in G with probability 3 is equivalent to including it in G with probability 


So this means we can guarantee 
n 


n 
a 
x(G) 2 a(G) 2logsn 


Do we also have an upper bound? Can we show that we can color G (n, 4) with that many colors? 


Theorem 4.24 (Bollobas, 1987) 


The chromatic number 


We'll see how to prove this later on using martingale convergence. 


4.5 Number theory 


This class was advertised as using probability to solve problems that don't involved probability. The next few examples 
have no randomness inherently, but we'll still use the second moment method to solve them. 
Let v(n) denote the number of prime divisors of n, not counting multiplicity. Can we figure out the typical size of 


y(n) just given n? 


Theorem 4.25 (Hardy - Ramanujan 1920) 


For all €, there exist a constant c such that all but € fraction of the numbers [1, n] satisfy 


|v(x) — loglogn| < clog log n. 


Remark. log refers to natural log in number theory contexts. 


Proof by Turan, 1934. We're going to use a basic intuition about a “random model of the primes.” Statistically, they 
have many properties that make them seem random, even if the primes themselves are not. 


Pick a random x € [n]. For each prime p, let X, be the indicator variable 


1 plx 
Xp = | 
0 otherwise. 


Then the number of prime divisors of x less than or equal to M is approximately 


X= XS 


PSM 
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1/10 


where we pick M =n a constant power of n. Then there are at most 10 prime factors of x larger than M, so 


v(x) -10< X < V(x). 


Since we're dealing with asymptotics, that constant is okay for our purposes here. We're treating X as a random 
variable: we want to show that It is concentrated and that its mean is around loglogn. Each Xj, Is also a random 
variable, so this is a good use of the second moment method: we have 


six] = UPI = 2 +0() 


n 


for each prime p, so the mean of the random variable is 


15-¥(6+0(8) 


PSM 


We'll now use a basic result from analytic number theory: 


Theorem 4.26 (Merten's theorem) 
Adding over all primes up to N, 


oy ae loglog N + O(1). 


PSN 


To find the expected value of X?, we need to understand the covariance between different Xps. For any primes 
PF, 


covlXp, Xq] = E[XpXq] ~ E[Xp]E[Xq] = L/D! _ nih! fal < © € *) (2 *) <2 (242). 


The idea is that these variables are basically independent by Chinese Remainder Theorem, except for the “edge cases” 


near n. So the total sum of the covariances is 


1 1 1 2M 1 
Kee a < < n-9/° log logn = o(1 
S- cov[Xp, ae S- ( )s Geer oglogn = o(1), 


n 
p#q.P.qSM PAG.P,GSM 4 q p<M 


since M = n*/19_ Now the variance of X is 


var(X) = S-var(Xp) + 0(1) = loglogn+ O(1) 


(which is not very large), and therefore the standard deviation is on the order of /loglogn. Now by Chebyshev's 


inequality, , 
Pr (Ix — loglog n| > Ax/log log n) < 2 +o(1), 


and since X is within 10 of v(x), we've shown concentration with high probability (just pick to be whatever constant 


we need in terms of €). 


What's the distribution, though? Is Vloglogn the right order of magnitude? If we really believe the Xps are 


independent, we should believe in the central limit theorem. 
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Theorem 4.27 (Erdds-Kac theorem) 


Picking a random x € [n], v(x) is asymptotically normal: 


Pr y(n) —loglogn s 
x€[n] Vlog log n 


for all AE R. 


We briefly mentioned the method of moments earlier: instead of looking at second moments, look at higher 
moments as well. There’s a theorem in probability that if all the moments of our function are the same as certain 
distributions (including the normal distribution), then convergence happens. 

We can do this explicitly if we want, but it gets a bit tedious. Here’s a trick that simplifies the calculation: let's 


compare E[X*] with that of an “idealized” random variable Y. 


Proof. This time, set M = n*/5(") where s(n) > oo slowly. Choosing s(n) = logloglogn is fine, but s(n) can’t grow 
too quickly because we have that 
v(x) — s(n) < X < v(x). 


(Joke: What's the sound a drowning number theorist makes?...) So now let 


ee eee 


PSM 


where Y, is now idealized to Bernoulli(+), independent of the other variables. This is supposed to model Xp. So now 


let 


w= EY] ~ E[X], 


and 
a? = var(Y) ~ var(X). 


Set 


By the central limit theorem, we know that Y converges to N(0,1). Now let’s compare Y and x, showing that for all 
k, 


a[X"] = ELV", 


which are (by the central limit theorem) also equal to E[Z*] for the standard normal distribution. 


When we expand out the factors of E[X* — Y*] for distinct primes pi,--- , py < M, they look like 


1 n 1 1 
E[Xp, Xp. °** Xp, — Yo. Yp- | = | Ear -o(5). 


So if we compare the expansions of X* in terms of the XpS, there's Mk = n°) terms. Since each term contributes 


O (=), the moments are essentially the same: 


ux’ — ¥*] = nO) = 0/1). 


Since all moments converge, x converges to the normal distribution asymptotically. 
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4.6 Distinct sums 


Question 4.28. What's the size of the largest subset S C [n] such that all 2lS| subset sums of S are distinct? 


Example 4.29 


We can take S = {1,2,4,--- , 2k}, where k = |logs n|. All sums are distinct by base-2 expansion. 


This set has size logo(n). Is there any way we can do much better? 


Problem 4.30 (Open; Erdés offered $300 for this one) 
Prove or disprove: |S| < logy n+ O(1). 


One thing we know is that all subset sums have size at most n|S|, since there are only |S| things we can add. 


There are 2!5! sums, so if they're all distinct, by Pigeonhole, we must have alsl< n|S|, which rearranges to 
|S| < logy n+ logs logy n+ O(1). 


Can we formulate a better argument than this? Let's try the second moment method! The idea is that if we pick a 


random subset sum, we should expect some concentration around the mean. 


Theorem 4.31 


Every subset S C [n] with distinct subset sums has 


1 
|S| < logs n+ 5) logs logg n+ O(1). 


Proof. Given our set S = {x1,--- , xXx}, define a random variable X = €,x, +--+ + €xx_% where €; € {0,1} uniformly 
and independently. The mean is (by linearity of expectation) just (x +.--+-+x,), and the variance is 
1 n?k 
o? = “(x7 4 bx) So 


since all xj <n. By Chebyshev’s inequality, for all A > 1, 


Pr | |X —p| < 


An/k 34 1 
2 ° r 


But X must take distinct values for all different instantiations, so the probability that X = x is at most 2~* for each 


x. This means that in the probability expression above, X must lie in the range [u AOE { ai , which has a 
probability 
An’ k 
Pr ||X — | < = < 27k. (AnVk +2). 


Putting these inequalities together, 


1 7 
bas? k(\nvk + 1), 


which rearranges to 
2k(1—A-7)-1 
a> 


Vkx 


We can choose X to optimize this expression: in this case, \ = 3 yields the desired result. 
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4.7 An application to analysis 


We're going to prove the following result using the second moment method: 


Theorem 4.32 (Weierstrass approximation theorem) 


Let f : [0,1] — R be a continuous function on a bounded interval. Given € > 0, it is possible to approximate f by 
a polynomial p(x) such that 
|p(x) — f(x)| < € Vx € [0, 1). 


Proof. First of all, since [0, 1] is compact, f is uniformly continuous and is therefore bounded. In other words, there 
exists a 6 such that 
lf(x) — F(y)| < " for all x, y with |x —y| <6. 


Rescale f so that it is bounded by 1, so now |f(x)| < 1 for all x. Let X be a random variable X ~ Binomial(n, x): 
then 
Pr(X =/)= ("xa —x)"7 for allO<j<n. 


We know the statistics E[X] = nx, var(X) = nx(1 — x) < n. So by Chebyshev, 


Pr [ix — nx| > r?/3) ag, 
In particular, if we take n fixed but large enough — let n > max(64e7%, 67) — we can bound this in terms of eé: 
E 
Pr [Ix — nx| > r?/3) < a 


We can now write down our approximating polynomial explicitly: 


n 


P,(x) = >> (")sa — x)rif (<) 


i=0 


Basically, chop up [0,1] into n intervals and sample the value at each one. We claim that this works: we do have 
|Pn(x) — f(x)| < € for all x € [0, 1]. To show this, note that by the triangle inequality, 


IPa(x) — F091 < > (7) -9"']F (2) - F009 


implicitly using that the sum of Gay a =x)! = 1. The idea ts that this absolute value is small if x = L; otherwise, 


Chebyshev bounds the contribution! We'll split this up into two terms - those close and far away from our given x: 


= »- (5 )a-9" (+) — F(x)| +2 S- (i )a-0" 


i:|4—-x|<n-¥/3 i:|4-x|>n-¥/3 
where the 2 comes from the fact that |f(x)| <1. But now note that the absolute value in the first term deals with 


those x within 6 of 7 and the second term was bounded earlier: 


< bs ((7)xa nyt 5 b2-5, 


i:|4-x|<n-¥/3 


and now both terms are at most As so this is at most é, as desired. 
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5 The Chernoff bound 


5.1 Setup and proof 


The second moment essentially compares the values of E[X] and E[X?] to each other. Why do we not take higher 


moments? In general, if we have independent random variables 
X =X. +--+ Xn 


we can look at p(X), which is some polynomial in X, and apply Markov's inequality in the same way that we did for 
Chebyshev's inequality. It turns out that if we're allowed to look at arbitrarily high-degree polynomials, it’s usually 


better to just look at the following object: 


Definition 5.1 


£ 


a random variable X is a function of t 


The moment generating function o 


t?E[X?] | 


Wee se aps 


What are its applications? 


Theorem 5.2 (Chernoff bound) 
Let S, =X, +---+ Xp, where X; = £1 uniformly and independently. Then for all A > 0, 


Pr(Sp > AVn) < eo /2. 


This gives better tail decay! While the second moment method only gave us polynomial decay (right hand side of 


the form 52): this is exponential decay instead. 


Proof. Let t > 0 be a real number, and consider the moment generating function 


Def]. 


Since S, is a sum of random independent variables, this is 


p[eMate+Xe] = Ele ]E[e™?] --- Ele] = E[e™]" = (= 7 =) : 


Since ae < ef /2 by comparing coefficients of the Taylor expansions: 


1 2 1 
(2n)! — nl2r’ 


our moment generating function is < ent? /2 and by Markov’s inequality, 


ufetS>] 


etaAVvn 


Pr(S, > AVN) < < e PAV nt EP n/2 


and setting t = a gives the desired result. 


By symmetry, we have a bound for S, < A./n as well, so combining these, we obtain the following: 
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Corollary 5.3 


Using the definition of S, above, 


Pr (|Sp| > AWvn) < 2e-**/? 


for all A > 0. 


But notice that S, converges to a Gaussian distribution for large n, so something similar should be true for Gaussians 


as well. This is indeed true: 


Fact 5.4 
For the standard normal distribution Z ~ N(0,1), for all A > 0, 


Piz = Aj= Pre = eA) e “ble@[—e 8 22a? 


by taking t = X. 


This is pretty tight: it turns out that in general we're only losing a cVX, and in reality we actually have 


2/2 


Piz > d)~ oo 


See Appendix A of the textbook for different instantiations of the Chernoff bound. Similarly, we can find exponential 


decay for Bernoulli variables where p 4 5: 


Fact 5.5 


If Y is a sum of independent Bernoulli variables (with not necessarily the same probability), then for all e« > 0, 
Pr(lY —E[Y]| > eE[Y]) < 2e ™™ 


for some constant C,; > 0. 


5.2 An application: discrepancy 


Theorem 5.6 


Let H be a k-uniform hypergraph with m edges. Then we can color the vertices red and blue so that every edge 


has an O(./k log m) difference in the number of red and blue vertices. 


Proof. Color each vertex uniformly at random: put +1 on every vertex. Then every edge is of the form S, = 
Xi +-:++Xm where all X,, = +1, so by the Chernoff bound, the probability |S,,| exceeds AWk is at most 2e~>”/2. 


Note that the absolute value of S, is exactly the difference between the number of red and blue vertices. 


In particular, we can now do a union bound: if 2me/2 < 1, then there exists a graph where none of the bad 


events happen. Inverting this gives the desired result. 


This kind of log term usually comes from the Chernoff bound. If we only used the second moment method, we'd 
have a much worse result - polynomial instead of exponential? 
Well, what's the truth? Suppose m= k: this theorem gives us a difference of \/klog k between the red and blue 


vertices. But we can do much better: 
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Fact 5.7 


Spencer's paper “Six standard deviations suffice’ says that when m = k, we can get at most 6\/k difference 


between the number of red and blue vertices on every edge. 


5.3. Chromatic number and graph minors 


Let's start with some graph theory results for motivation: 


Proposition 5.8 (Kuratowski’s theorem) 


If G is not planar, then it contains a K3.3 or Ks subdivision. 


Here, a subdivision of a graph H is H with some of the edges chopped into smaller pieces. Basically, Kss and 
K3,3S are not allowed, nor are those graphs with extra vertices along the edges. There's another similar theorem that 


is actually equivalent to Kuratowski's theorem: 


Proposition 5.9 (Wagner's theorem) 


If G is not planar, then G contains a K3,3- or Ks-minor. 


Here, H is a minor of G if it can be obtained from deleting edges/vertices or contracting an edge. (Basically, take 


the two vertices of an edge and squish them together.) In particular, Ks is a minor of a Ks-subdivision. 


Theorem 5.10 (Four-color theorem) 
If x(G) > 5, then G is not planar. 


In particular, if x(G) > 5, it must contain a K3.3-minor or a Ks-minor. Having a Ks-minor seems pretty relevant, 
since we need 5 colors to color a Ks. But K3,3 doesn’t seem like as much of an obstruction, and that’s quantified in 


the statement below: 


Fact 5.11 
If x(G) > 5, then G contains a Ks-minor. 


Well, does this hold if we replace 5 with other numbers? 


Conjecture 5.12 (Hadwidger’s conjecture) 


If x(G) > t, then G contains a Ky-minor. 


Many people consider this to be the biggest open problem in graph theory! We do have some small cases resolved: 
t = 1,2 are trivial. t = 3 is not too hard: If G has no K3-minor, it is a tree, which is 2-colorable. t = 4 requires more 
work but is elementary, and t = 5 is equivalent to the four-color theorem (for which we only have a computer-assisted 
proof). But Robertson, Seymour, and Thomas showed that the four-color theorem actually implies t = 6, and all 
t > 7 are open. 


Are there variations on this conjecture? 
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Proposition 5.13 (Hajos conjecture: even stronger) 


If x(G) > t, then G has a K;-subdivision. 


Unfortunately, this is false. In fact, by the probabilistic method, Erdés and Fajtlowicz showed that G (n, 5) fails 
this condition with high probability: 


Theorem 5.14 


With high probability, G (n, 5) has chromatic number x(G) > (1+ 0(1))a7g¢q and no Kpyoymj-Subdivision. 


So the theorem is very false in the relation between the two parameters, as well as in its likelihood! Note that the 


Hajos conjecture Is still true for small t: it just fails for larger t due to the arguments below. 


Proof. We already lower bounded by upper bounding color classes as independent sets: 


n 


n 
ee 
x(G) 2 a(G) 2logsn 


with high probability. Let's work on the second part. 

Suppose we have a K;-subdivision, where t = [10,/n]. Out of the (s) edges in Kz, about half of them are not 
contained in G, so they must use up other vertices to form paths, and we don’t have enough of those. 

Let’s do this more rigorously. Let G have a K;¢ subdivision S C V, where |S| = t = [10,/n]. At most n edges 
in the subdivision can be paths of at least 2 edges (rather than just straight lines between vertices), since each path 
takes up an external vertex, and all paths use distinct vertices by definition. So the number of edges E involved in the 


subdivision satisfies 


where the > comes from picking a large enough constant in t = c\/n (we chose c = 10). But this inequality fails with 
high probability, since we're supposed to have $(5) edges only. Indeed, for every fixed t-vertex S, each edge appears 


with probability 5, so the number of edges in the subgraph induced by S satisfies 


3/t 2 
i < e-t?/10 
pr(Ez z(5)) <e 


by the Chernoff bound. Now by a union bound, ranging over all t-element subsets of vertices, the probability of any 


subdivision being possible is bounded above by 


Dew Zeer = o(1), 


so there must not be a K;-subdivision with high probability. 
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6 The Lovasz local lemma 


The local lemma is an extremely important tool, and we saw a basic use of it early on when finding bounds for diagonal 
Ramsey numbers. Recall that the vanilla way to do this (a union bound) doesn’t use the fact that there are few local 
dependencies between our “bad events.” 

Most of our proofs in the previous few sections worked with high probability, but now we are shifting gears: this 


method is getting us a small but nonzero probability of success. 


Theorem 6.1 (Local lemma, symmetric case) 


Given events Aj,--- , An, where all Pr(A;) < p, if all Ajs are mutually independent from the set of all Ajs except 


at most d of them, and 
ep(d+1) <1, 


then with positive probability, none of the events occur. 


(e is the best constant we can have here.) If the ps are small, for instance if all A; < 7 union bounding tells 
us that we have a scenario where none of the events happen. Alternatively, if the Ajs were independent, we can just 


multiply the non-event probabilities. So this is a sort of happy medium between the two extremes! 


Definition 6.2 
An event Ap is mutually independent of {A;,--- , Am} if Ap is independent of every event of the form ByN Bon 


---1 Bm, where all B; = A; or Aj. 


In other words, any boolean information tells us nothing about Ap. As a warning, this is different from saying 
that Ap and A; are independent for all /, since events can be pairwise independent but not mutually independent. 
(For example, consider the three variables X1, X2, X1 + X2 mod 2, where X, and X2 are uniform among {0, 1}.) 

Almost all applications of the local lemma are of the following form: we have a set of variables (2, and each A; 
depends on a subset S; C (2, so two events Aj and A; are independent if S$; 5S; 4 S. 


Let's start by looking at a few applications before returning to the proof of the theorem. 


6.1 Coloring: hypergraphs and real numbers 


Let's first consider a hypergraph: we wish to color all vertices red and blue so that no edge is monochromatic. 


Theorem 6.3 


gk-1 
e 


A k-uniform hypergraph G is 2-colorable if every edge intersects at most d = 


— 1 other edges. 


Proof. Color the edges uniformly at random. For each edge f, let Ar be the event that f is monochromatic: each 
occurs with probability 2-*+! = p. 
Each bad event Ar is mutually independent of all other events Ar if f and f’ do not share any vertices. (Note that 


here our probability space Q is the vertices of our graph G.) By the Lovasz local lemma, since ep(d + 1) > 1, with 


positive probability, there is a graph whose coloring is proper. 


What are some consequences of this? 
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Question 6.4. For which k is every k-uniform, k-regular hypergraph 2-colorable? 


A triangle is not 2-colorable, and the Fano plane is not 3-colorable, so our statement is false for k = 2,3. Well, if 


we apply the theorem we've just proved, every edge intersects k(k — 1) other edges, so if 


k-1 


2 
ke 1) = —1, 


then the statement is true. It turns out that is good enough for k > 9 - what about 4 < k < 8? It’s known that the 
statement is actually true for all k > 4, but that is much harder to prove. 
Let’s ask a related question now: say we're looking at k-colorings of the real numbers. Basically, we have some 


function that assigns one of the first k positive integers to each real number: 
c:R- [kK], 


or we can think of our domain as Z instead if the Axiom of Choice is annoying to deal with. Say that a subset 7 CR 


is multicolored with respect to c if all k colors appear in c(T). 


Theorem 6.5 
Fix k. Given a subset S of the reals with |S| = m, we can color the real numbers with k colors so that every 


translate of S is multicolored if 


e(m(m—1)+1)k (: — a << kk 


Doing some calculations (omitted), this can be written as m > (3+ €)klogk for sufficiently large k. This is a 


hypergraph coloring problem, but we can still use the Lovasz local lemma to solve it. 


Proof. First, we'll show that we can color every finite subset X C R such that every x + S C X is multicolored. 
Color X uniformly at random among the k colors. Our “bad events” correspond to elements x € X where x + S 
is not multicolored, so we can think of having a hypergraph where the vertices are elements of X and the edges 


correspond to translates. Each fixed translate x + S C X is not multicolored with probability (union bound, pick a 


1 m 
< = 
psk(1 z) . 


Furthermore, each translate of S intersects at most m(m-— 1) other translates (pick a pair a € S,a’ € S’ to overlap). 


color to not include) 


So now by the local lemma, there exists a good coloring for every finite set as long as the ep(d + 1) < 1 condition is 
satisfied. 

But how do we extend this to the reals? We use compactness: Tikhonov's theorem says that if we assign a discrete 
topology to k points, then [k]® is compact. A point in [k]® is a map from the reals to 1 through k, which is basically 
a coloring. 

So for every x € R, let C, C [k]® be a subset of colorings so that x + S is multicolored. Our goal is to show that 
we can make all of the translates multicolored at once. We've shown so far that every finite intersection of Cys is 
nonempty by the local lemma. Under the product topology, Cy is closed for all x € IR (the set Cy is only limited by 


the finite set of reals in X), so the intersection of all closed sets 


Cai he, 


xER 


is nonempty as well, and that’s a coloring of all the real numbers. 
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The logic here is “we have a bunch of closed sets, any finite intersection is nonempty, so the infinite intersection 
is nonempty.” To elaborate on this, think of Z instead of R. Then this is a diagonalization argument: if we can color 
every prefix of the positive integers, then we can write down something for each prefix, and find infinitely many ways 


of coloring 1, then 2, and so on. 


6.2 Coverings of R° 


Here's a motivating fact that we won't prove: 


Fact 6.6 (Mani-Leviska, Pach) 


For every k > 1, there exists a nondecomposable k-fold covering of IR? by open unit balls. Here, a k-fold covering 


means that every point in R? is covered at least k times. A covering is decomposable if it can be partitioned into 


two coverings. 


(This generalizes beyond 3 dimensions, by the way.) Maybe all the points are covered a uniform number of times: 


can we say that there aren't outliers in our covering? The answer is no: 


Theorem 6.7 


There exists an absolute constant c > 0 such that every nondecomposable k-fold covering of IR? by open unit 


balls must cover some point at least c2‘/3 times. 


Proof. \f we try to decompose a covering, we're coloring the unit balls red and blue so that each color forms a covering 
of IR? on its own. Our “bad events” here are where x € R® is only colored by one color - if no bad events occur, then 
we have a successful decomposition. 


Construct an infinite hypergraph where the vertices F are the set of balls and the edges are 
E, = the set of balls in F containing x. 


So the edges are 
E(H) ={E,: x € R%} 


where two points in the same “cell” correspond to the same edge - this means decomposable is the same as 2-colorable 
in our hypergraph. 

For the sake of contradiction, assume every point is covered at most t times, where t is to be determined. By a 
compactness argument, it suffices to show that every finite subgraph of H is 2-colorable. 

We claim that every edge intersects < t? other edges. Two edges intersect if they share at least one ball in common: 
if we fix E,M Ey that are intersecting, since every point is covered at most t times, there are at most 43¢t balls involved 
in Ey, since anything intersecting with x has to be within a ball of radius 4. It turns out m balls can cut R? into at 
most m? + 1 regions, and each region corresponds to an edge. So there are at most 4°t? + 1 other edges that any 
edge can intersect. 

So now the rest is just applying the Lovasz local lemma. Each edge is monochromatic with probability at most 
2-*+1 (since it is covered at least k times), and in the dependency graph, the number of intersections is at most < m?. 
So we can apply local lemma if 

pa 2c 


for some sufficiently small constant c. So for t < 2/3 the graph is decomposable by Lovasz, and therefore every 


nondecomposable k-fold covering must cover some point = 2k/3 times. 


47 


By the way, to prove the claim that m balls can cut R? in at most m?+ 1 regions, we use induction. Adding a new 
ball creates regions if we have intersections with other balls — this is at most the number of regions on the sphere cut 


by m circles, and we can make arguments from there. 


6.3 The general local lemma and proof 


As previously stated, we have a bunch of bad events Aj,--- , A, that we are trying to avoid. We occasionally have that 


A; is mutually independent of all other Aj; except for some set N(/) for each i. (Notice that N(/) does not include /.) 


Definition 6.8 


The dependency graph is constructed by having a vertex for each bad event / and joining / to the set N(/) (of 


dependencies to /). 


This graph is sometimes directed, but in almost all applications, it’s sufficient to make it undirected. For example, 
if we have a hypergraph coloring problem and we're coloring each vertex at random, the dependency connects edges 


that have nonzero vertex intersection. In such a setup, we have the following: 


Theorem 6.9 (Local lemma, symmetric form) 
If every node in the dependency graph has degree at most d, and every event has probability at most p, then there 


iS a positive probability that no bad events occur as long as 


ep(d+1)<1. 


We're going to be proving a more general form of this: 


Theorem 6.10 (Local lemma, general form) 


If we have real numbers x1,--- , X_ € [0, 1) such that for all /, 


Pr(Ai)) <x [] a-~), 


JEN(i) 


then with probability at least [];_,(1 — xj), no event A; occurs. 


Here, note that x; is not the probability of A;: it’s something larger than x; which may be weighted down by the 
other terms. 


By the way, notice that if we can find values x; to plug in, we can get the symmetric case from the general case: 


Deducing the symmetric case. Set all x; = at <1. Then notice that 
i 1 \WOl 1 ae 1 
1 )= 1 > 1 = 
pele %) reat 73) >a ( 743) * (d+ ie’ 


and if we have the hypothesis of the symmetric case of the lemma, then 


x, [[ @-%) = p> Pr(Ai). 
JEN(i) 


and therefore the general case’s hypothesis must hold as well. 
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Here's another way to specialize the general form: 


Corollary 6.11 
If all events have probability Pr(A;) < 5, and 


ST PA) = 


JEN(i) 


for all /, then there is a positive probability that no A; holds. 


Proof. Set x; = 2Pr(A;); this is always less than 1, and the product 


x [[ Q-6) =2Pr(A) [] @- 4) = Pr(Ai), 


JEN() JEN(i) 


so all probabilities are smaller than their corresponding products, and we can use the theorem in its general form. 


We'll now present the original proof of the local lemma from its publication — it uses induction. 


Proof. Say we have n events. Let S be a subset of [n] which indexes our bad events: we'll induct on |S]. The induction 


hypothesis is the following: 


Proposition 6.12 


If we have an event / ¢ S, then 


Pr Aj | N\A <SAXis 
JEs 


Basically, this is the probability A; occurs, conditioned on the fact that none of the events indexed by S occur. 


If we prove this, then by Bayes’ formula, the probability that none of the events occur is at least 
Pr(Az) « Pr(Az | Ar) «+--+ Pr(An | ArA2An—1) > (1 — x1)(1 — x2) +++ (1 = Xn), 


where the last inequality comes from the inductive hypothesis. So this would imply the local lemma. 


First of all, this proposition is easy to show when S is empty, since the probability that A; occurs is 
xj II (l-xj) <x. 
JEN(i) 
Now for the inductive step, we know / has some neighbors in S: call this set S$; = SM N(/), and call everything 
else So = S \ Sy. 
Let's understand the conditional probability Pr (Ai | N\ies Aj). We can separate this into contributions from S; and 


contributions from S2 using Bayes’ rule: 


= Pr (Ai A Nes: Aj | N\iess Aj) 
Pr (jes: Ai | \jess Ai) 


First, we can upper-bound the numerator by forgetting the dependencies that are hard to control: this is at most 


Pr} Ail A Ay 
JES2 
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since we're Just removing some conditions on what needs to happen. But now since A; is mutually independent from 
all of its neighbors, this is just Pr(Aj) < xi [Tjeniy(1 — xj) by the assumed conditions. 
Meanwhile, we can also lower bound the denominator. Label the elements of S; = {j1,--- ,j-}; as before, we can 


write the denominator as a product of conditional probabilities 


Pr Ail (\ Ai Pr An lA, A \ A “fe PE A, | Ai AAA (\ Aj 
JES JES2 JES2 
S> is a smaller set than S (or else there are no dependencies in S at all to A;, in which case this is easy). So by the 


induction hypothesis, each event Aj, occurs with probability at most x;, so this is at least 
(1-x,)(1-%,)---(-%) >= [J a-*) 
JEN(/) 
(since the LHS product is some subset of the RHS product). Putting the numerator and denominator together, we're 


done! The probability that A; occurs given that none of the events in S occurs is at least 


Xi Tem - x) . 
[iemy(1 — x) " 


as desired, completing the inductive step. 


6.4 The Moser-Tardos algorithm 


We now know that under certain circumstances, it is possible to avoid all of our bad events. But is there a nice way 
to algorithmically determine an example of this? It’s not even obvious how to do a randomized algorithm, since the 
chance of success (avoiding all bad events) is generally so small. 

Consider a random variable model to make our problem less abstract: we have a collection of independent random 
variables, where each event A; depends only on some subset of variables. We have a dependency graph, where A~ B 
if A and B share common variables. Our goal is to find a way to flip the independent variables so that the Ajs all do 
not occur. 


It would seem like this algorithm could be very sophisticated, but here’s one that isn't! 


Algorithm 6.13 


First, initialize all our random variables with some values. While some bad event A; occurs (pick one arbitrarily), 


resample all of the variables that A; depends on, since A; only depends on some finite set of variables. Maybe this 


induces some other bad events: we just keep running the while loop. 


We might be worried that this algorithm might never terminate, or on average, maybe it terminates after an 


exponential number of iterations. It turns out this never happens: 


Theorem 6.14 (Moser—Tardos) 


If the Lovasz local lemma conditions hold, then the algorithm is expected to resample each A; at most =“ times. 


In particular, the expected number of iterations of the while loop is at most 


n 
Xj 
— : 
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Note that this algorithm is agnostic of the xjs: it doesn’t change based the parameters that we're using. 

There are two key concepts in this proof that are needed. We need an execution log, which is a list of the 
resampled events at each step (so we add on 1 event per run through the while loop). 

We also need a witness tree, which is a finite rooted tree labeled by events such that the children of / are distinct 
from each other and are a subset of N(/) U {7} (the neighbors of / and itself). 

Why is this called a witness tree? We're going to use a prefix of our execution log to make a tree (though the 
lumberjack may say to go the other way). 

Basically, given a prefix of the execution log, read the log right to left. The rightmost event is the root of the 
tree, and for each subsequent event A, if some node in the tree overlaps with A (in other words, it lies in N(A)/N {A}, 
then add A as a child of a deepest such node (arbitrarily) If nothing is dependent with A, just discard it. 

The key idea here is that this witness tree tracks some of the possible ways we could have progressed in our 


algorithm from bottom to top. 


Example 6.15 


Let’s say we have the dependency graph 


and we have an execution log with prefix FBDEADBFDC. 

When we make our rooted tree, C is the root. D is next-to-last, but it couldn't have caused C to be sampled 
since C and D are not overlapping, so it is discarded. F is adjacent with C, though, so F is now a child of C. 
This represents the statement “C could have been sampled as a result of F.” 

Now B must be another child of C for the same reason the next D is discarded, A is a child of B, E is a child 


of B or F (arbitrarily choose F), and so on. 


Events at the same depth are independent, because if they weren't, one would be forced to be a child of the other 


when it is inserted. 


Lemma 6.16 


For a given log, all prefixes produce distinct trees. 


This is not too hard to show. Given any tree with root A, the number of times A appears in it is just the number 
of As up to that point in the prefix of the log. So all such trees must have a different number of As, and all trees with 


root A are different from trees of root B, and so on. 


Lemma 6.17 


For a given witness tree 7, the probability that T appears as a witness tree of some prefix is at most 


[[ Pr(Ap). 


veT 


where Ajy is the event corresponding to the node v in the witness tree. 


To show this, we'll introduce some randomness. 
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Definition 6.18 


Let a simulation of T to be the following process: visit all nodes of T in reverse depth (the order doesn’t matter 


for each given level). Then resample all the variables at each visited node. The simulation succeeds if all bad 


events encountered do occur. 


Proof of lemma. How can we ensure that there’s a level of consistency between the tree and simulation? We're going 
to do a coupling of the two, where we basically have two processes X,Y — (X,Y) that act in parallel. This way, if 
we look at just X or Y alone, it looks identical to an initial distribution, but X and Y aren’t necessarily independent. 
Our goal is to do this in such a way that the simulation always succeeds if 7 appears as a witness tree. Then, 
the probability that T appears must be at most the probability that the simulation succeeds, which gives us a bound 
we want. 
What's the probability the simulation succeeds? This is easy: each step in the tree resampling is independent, so 
this is just 
[] Pr(Ap)- 
veT 
To make this useful, we need to find a common source of randomness. Let's say we have access to an infinite list of 
realizations of each random variable, and resampling is just taking the next realization of the list. The key idea is to 
use the same list to run both processes. We claim that if T appears as a witness tree, the simulation must succeed. 
Why? During the execution step, there’s an initialization of all variables. Since T appears as a witness tree, we 
know that looking at our execution log, each of the bottom nodes must have been “bad events” that were logged; 


since this is coupled directly to our witness tree, the same bad event must have occurred there. Now keep propagating 


up the tree; we'll notice that each event must have occurred because the prior resampling did not work. 


One last tool we're going to use in our proof is the multitype Galton-Watson branching process. Generate a 
tree in the following way: specify a root labeled by some event A;, and for each possible child A; 4 Aj, keep it with 
probability x; (this is not the probablity of A; - it is the corresponding real number from the statement of Lovasz). 
Repeat the same process for each node; each child survives with some probability x; or dies with some probability 
1-x. 


When this terminates, we get some finite or infinite tree. 


Lemma 6.19 
Then T is yielded with probability 


x; 
pigs Xj IT xu: 
J veT 


Xv] = Xv] [[a = Xj). 
J 


Basically, we’re just multiplying the probabilities of each vertex working out, and the normalizing factor comes from 
the fact that our root is already formed for sure. We can think of this as a weighing of our trees in an information- 
theoretic manner. 


It's time for us to prove the Moser-Tardos theorem: 


Proof of Theorem 6.14. The number of times an event A; is resampled is the number of times a witness tree is 
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generated, rooted at A;. We expect this to be 


y Pr(T appears as a witness tree). 
T rooted at A; 


From our construction by coupling, this is bounded by the expression 
>» TL Pry). 
T root at Aj veT 
Now recall that the xjs in the Lovasz local lemma are at least their corresponding probabilities: 
< dv I]t. 
T root at Aj veT 


Now by the lemma above, this is just 


Pe 
- ii me ys — 


' T rooted at Aj 
But summing over Py gives the probability that some certain trees are our final result, so that sum can be at most 1. 


Therefore 
Xi 


i[A; is sampled] < a 


‘ 


and summing over all events A; gives the desired result. 


6.5 A computationally hard example 


Unfortunately, there are instances where we'd like to use the local lemma, but an algorithm may not quickly yield an 


answer. Here’s a simple example of that: 


Example 6.20 
Let q = 2* be an integer, and let f be a bijection from [q] > [g]. Let y ben an element of [q]; then let A; be the 


“bad event” that f(x) and y disagree on the /th bit, where we choose x uniformly on the domain [q]. (There are 
k events A;,--- , Ax, since g = 2* has k bits.) 


Notice that all events A; are mutually independent, since the bits behave independently, by the “local lemma” (a 
very trivial use of it, aka just independent events), there exists an x such that f(x) = y. 

But algorithmically, can we always find this efficiently? As a foundation of cryptography, the answer is (believed to 
be) no. A concrete example is the function f : Fy + F, sending f(x) = g* for some generator g (except 0 at x = 0). 


Then we just have the discrete logarithm problem, which is believed to be computationally difficult. 


6.6 Back to independent sets 


\V| 
A+I° 


include it and throw away all of its neighbors, and repeat.) But we can achieve something better with the local lemma: 


We know that in a graph with maximum degree A, we have an independent set of size at least (Take a vertex, 


Theorem 6.21 
Let G be a graph with maximum degree A, and let the vertex set V = V, UW U--- UV, be partitioned into sets 


of size |V;| > 2eA. Then there exists an independent set with one vertex in each Vj. 
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Then the size of this set is similar to what we have naively (we lose a constant), but we have much more control 


over what the set looks like! 


Proof. First, we may assume all |V;| = k = [2eA] for simplicity: we can always toss away extra vertices that we just 
decide not to use. 

For each Vj, pick a uniformly random vy; independently from all the others: our goal is to show that with positive 
probability, we get an independent set. 

What are our bad events? We don’t want our vertices to be adjacent, and there’s a few ways we can try to 
approach this. 

Attempt 1: Let's let Ajj be the events that vj; and v; are connected by an edge. Then in the worst case, Pr(Aj;) < 4 
since any given vertex in V; can only be connected to A of the things in Vj. 

But how large is our dependency set? The events Aj; and A,;; are dependent whenever some vertex in V; U Vj Is 
adjacent to some vertex in V; UV;. The maximum degree of the dependency graph is therefore 2Ak: each of the 2k 


vertices in Vi UV; has at most A neighbors. So the condition for local lemma requires 


A 
— - 2Ak 
k 


to be approximately constant, but this is too large and this method fails. 
Attempt 2: So an alternative way to make bad events is to let Ae be the event that both endpoints of e are 


chosen, where e € E(G) is an edge between two different V;, V;. We know that 
1 
Pr(Ae) < Re 


since there are k vertices in each of V; and V;, and what do we know about the dependency graph? Again, Ae and Ar 
can be dependent if and only If one of the edges Is in V;UV;: thus the maximum degree is 2Ak. Thus by an application 
of the local lemma, as long as we have F , 

pill +2Ak)< S 


we are good. 


So in many applications, it's important to pick the right events to think about. 


6.7 Graphs containing large cycles 


Theorem 6.22 


For any k, there exists a d such that every d-regular directed graph has a directed cycle of length divisible by k. 


Here, d-regular means that each vertex has d outgoing and d incoming edges, and a directed cycle is a cycle with 
all arrows pointing correctly. Also, note that if we have a (connected) 2d-regular undirected graph, then we can find 
an Eulerian tour (since all degrees are even). So we can direct an undirected graph. If we have odd degree, we can 
just add a vertex (left as an exercise). Either way, this means that we can prove the undirected version of this theorem 


as well! We'll actually prove something stronger: 
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Theorem 6.23 


Every directed graph with minimum out-degree 6 and maximum in-degree A contains a cycle of length divisible by 


a constant k if 


6 
ed eee 
KS 1+ log(1+ 6A)" 


Proof. We'll simplify the problem: we can delete some edges and assume that all outdegrees are exactly 6. Assign 
every element a uniform random element of Z/kZ: only look at those vertices where the label increases by 1 mod k 
along the arrow. In other words, we can split up the vertices into k groups, corresponding to the residues. Then if 
there exists a directed cycle in this new graph, it must have length divisible by k. 

What can go wrong? It's bad to go along the cycle and get a dead end, so our goal is for every vertex to have at 
least 1 outgoing edge that remains. Then we can just keep going, and we must eventually terminate. 


So let Ay be the event that v doesn't have any outgoing edges. The probability this happens is 


ay 


since all 6 of the outgoing edges must not work. But the dependencies are a bit more subtle: when do we put an 
edge between Ay and Ay? Define N*(v) to be the out-neighbors of v and N*(w) be the out-neighbors of w: then 
we need to make sure 

(VUNT(v))U(WU(NT(w))) 4 2, 


since the event A, only depends on v and its out-neighbors. But we can do a little bit better: 


Lemma 6.24 
A, is independent of all Ay such that 


Nt+(u)n(N*(w)U {w}) = 2. 


Proof of lemma. To show this, notice that given a vertex u with some outgoing edges, if w points to u, A, is 


independent of Ay because A, has the same probability even when we fix the value of Ay. Specifically, if w points to 


u but not any of its out-neighbors, we can just pick whether w has outgoing edges first: this can’t affect u. 


So now we can calculate the degrees of the dependency graph: for a fixed u, the number of w that are adjacent 


to u in our dependency digraph is at most 6A. Thus the conditions of the local lemma are satisfied as long as 


k 


6 
ep(6A+1)=e ( *) (5A +1) < et /K(6A 41) <1, 


and rearranging gives the result. 


Fact 6.25 
We've often been saying “the dependency graph,” but the way we should probably think of it as follows: specify a 


graph, and then say that our collection of events is consistent with the graph we've created. Basically, there are 


multiple options for our dependency graph for a single setup. 
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6.8 Bounds on the linear arboricity conjecture 


Lemma 6.26 


Let k < d°°. Then every d-regular directed graph can be vertex-colored (in any way) using k colors so that 
dlogd 


every vertex has é + o( z ) in-neighbors of each color (and also simultaneously the same number of 


out-neighbors). 


So looking at our graph from every vertex, the graph looks fairly equidistributed. 


Proof. Color every vertex uniformly at random. Bad events are of the form “a vertex has the wrong number of in- 
neighbors or out-neighbors of a color”: denote Aj. to be the event “v doesn’t have the correct number of out-neighbors 
of some color c,” and analogously, let Ay. be the analogous bad event for in-neighbors. 


The probability that each Ay. occurs is the probability we deviate too much from the mean: 


: ; 1 d dlogd 
Pr (arom («. i) — -| >C k 


We do have to be careful about how we use our Chernoff bound, but we can pick C such that 


1 
Pr(Ay.c) < 100d3 


(to be safe in our calculations later). Now, when can At. and At, be dependent? This only happens if they're 


Vic w,c! 


reasonably close to each other in the graph: 
(VUNT(V))N(WUNT(W)) 4B, 


and the maximum degree that can occur here is (2d)*k. (Note that this is saying that if we condition on any event 


of the non-neighbors of v, the probability is still independent of those conditions.) Now by the Lovasz Local Lemma, 


we can check that ep(d + 1) < 1 and we're done. 


Now let’s move on to an open problem: 


Definition 6.27 


A linear forest is a disjoint union of paths. 


Conjecture 6.28 (Linear arboricity) 


The edge-set of every graph with maximum degree A can be decomposed into Atty linear forests. 


We can't really do any better than this for any given graph: any path only contributes 2 to the degree, so we need 
at least o forests. Notice that every graph of maximal degree A is a subgraph of a A-regular graph, so it suffices to 
consider A-regular graphs. 

For a long time, there was a constant factor gap between the best-known bounds, until Alon showed in 1988 that 
2 + 0(A) linear forests will work. Alon and Spencer found in 1992 that we can have 4 + O(n?/3) (which means there 
are some other log terms that are neglected). Finally, last year, Ferber, Fox, and Jain proved 4 + O(A2/5-2), 


There's also a directed version of this conjecture: 
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Conjecture 6.29 (Directed linear arboricity or DLAC) 


The edge-set of every A-regular directed graph can be decomposed into A+ 1 directed linear forests. 


This implies the undirected version: take G to be a 2d-regular graph, and now use an Eulerian tour to orient and 
get a d-regular digraph. If G starts as an odd-degree regular graph, we can just modify it a bit. 

Also, we need at least A + 1 directed linear forests for the same reason: every vertex can only contribute one to 
each vertex’s indegree and outdegree, but we can't get 1 on everything (because of the endpoints of our paths). 

Let's first outline some of the key steps before jumping into a somewhat weaker bound for DLAC. First, we show 
that the result is true if the girth of the graph is at least 8eA. The idea there is that (using Hall’s theorem) we can 
decompose this graph into cycles. This gives A subgraphs, each of which is a disjoint union of cycles. To turn this into 
a covering of linear forests, we can just cut out an edge from each cycle, and our goal is to make sure the remaining 
edges form a directed linear forest as well. 

After we prove this version with large girth, we can divide into subgraphs with large girth. That's also something 
we did: we found how to produce cycles with length divisible by k, and if k > 8eA, we can get cycles of long length. 


In our proof, we will use a few graph theory results: 


Lemma 6.30 (A consequence of Hall's theorem) 


A d-regular bipartite graph has a perfect matching. 


Notice that removing a perfect matching from a d-regular bipartite graph gives a (d — 1)-regular bipartite graph. 


Corollary 6.31 (Konig's theorem) 


Every d-regular bipartite graph can be decomposed into d matchings. 


We can then use Konig’s theorem to prove the following lemma: 


Lemma 6.32 


The edge-set of every A-regular digraph can be decomposed into A 1-regular spanning subgraphs. 


For a directed graph, a 1-regular spanning subgraph Is a collection of directed cycles using all vertices exactly once. 


Proof. Construct two copies of the original vertex set. For each edge going from v; to v;, draw an edge from v; in the 
first copy to v; in the second copy: this gives a A-regular bipartite graph. Applying Konig, we get A perfect matchings: 


collapse each matching back to the original graph, and it becomes a collection of 1-regular directed graphs (think of 


vj going to vj as a permutation of the vertices), as desired. 


Lemma 6.33 (An easy bound on the DLAC) 


The edgeset of every directed A-regular graph can be decomposed into at most 2A directed linear forests. 


Proof. Use the lemma above to get A 1-regular spanning subgraphs: for each of these, split the cycles into two paths 


arbitrarily, and each 1-regular spanning subgraph becomes 2 directed linear forests. 
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Theorem 6.34 (DLAC for large directed girth) 


The directed linear arboricity conjecture is true if the directed girth is at least 8eA. 


Proof. We can decompose the edgeset into A 1-regular spanning subgraphs, Fy, Fo,--- , Fa, and now each F; is a 
disjoint union of cycles. Our goal is to show that we can break up the cycles in a nice way. 

Consider the line graph of G, L(G), where the vertices are edges of G and e,; > @ is drawn if e, and e& are 
incident (share a vertex). Our goal is to select an independent set from this line graph that hits every cycle in the Fjs 
above, because this would allow us to get A+ 1 directed linear forests. 

By the first lemma, we know there exists a matching M Cc E(G) that contains an edge from each cycle, since we 
have partitioned our graph into cycles have length at most 8eA, and each vertex in our line graph has degree at most 


4A. Take Fy \ M, Fo \ M,---: we've now broken up the cycles, so each of those is a linear forest. Thus M plus these 


gives the A + 1 directed linear forests, as we want. 


Now we're ready to prove the main result of this section: 


Theorem 6.35 (A better bound for DLAC) 


Every A-regular directed graph can be decomposed into at most A+ O(A3/*) (possibly contains poly-log factors) 


linear forests. 


Our goal is to produce a lot of cycles with long length. One thing we can do, as last time, is to label our vertices 
mod k, and make sure all cycles go from vertices / to /+ 1. 
Proof. Pick a prime k in the range [10VA, 20V A} (this exists). By the second lemma, there exists a coloring of the 


vertex set by elements of Z/kZ, so that every vertex has 4 +O (V2 neighbors of each element. 
For each /, let D; be the subgraph of edges whose color increases by / mod k from the start to end point. We 


know that A;, the maximum degree of Dj, is at most 4 40 (\/3). For all / 4 0, D; has girth at least k > 8eA; 


(this is where we need k > VA), so the DLAC for large girth tells us that we can decompose each D; for nonzero i 


using A; + 1 linear forests. For Do, use the easy bound of twice the degree from Lemma 6.33. This means the total 
A» A z 
< € +O (y *)) (kK+1)<A+0(A*), 


Notice that we needed k to be prime to ensure that all cycles go through all k colors. 


number of linear forests is 


as desired. 


How can we improve this? Notice that we can decompose Ky into Hamiltonian paths: in general, we can decompose 
any Kon. If we have 2n colors in our proof above, consider all the edges between the color groups: decompose into A, 
matchings for each “connection” between color groups, and this means we can decompose into A; paths, minus some 


edges. Applying the easy bound to the edges within the color groups again, this gives k Hamiltonian paths, and our 


(to0(F9) P0609 


and optimizing for the correct value of k, we minimize this at 


bound is now 


2RLO i, 


58 


This was nice because we can pick a smaller value of k, and we don’t have the girth requirement anymore! 


6.9 The lopsided local lemma 


Remember in the proof of the local lemma, we made a bound of the form 


numerator < Pr | A; | i Aj 
JES2 
where So is elements outside of N(/) [not connected]. By the definition of the dependency graph, this is just equal 
to Pr(Aj) < x; Hiemy( — x;). This was the only place we used the independence hypothesis! 
What if we instead just assumed an inequality in that equality? In other words, what if we always just had positive 
dependence, so bad events actually make each other less likely to occur? 
Here's the formal setup. Let’s say we have some bad events A,,--- ,Ap,. The negative dependence digraph is 


one where (if N(/) is the outneighbors of /), we have 


Pr {| Ail \ Aj | < Pr(Aivi.S CNC). 
Jes 
So this graph records potential negative dependence: N(/) notes the bad events that can potentially be worrying 
for us (since the events are more likely to occur separately), and everything else either does nothing or actually helps 


us (because it means the events have positive dependence). This was called the lopsidependency graph. 


Theorem 6.36 (Lopsided Lovasz Local Lemma) 


If there exist real numbers x1,--- ,X_, € [0, 1) such that 


Pr(Aj) < x; Il (1—x) for all / € [n], 
JEN(i) 


then with probability at least [];_,(1 — x;), no event A; occurs. 


The proof is exactly the same. Just change the boxes equals to an inequality. In fact, we can replace the < Pr(A;) 
condition with < xj [Trem (1 — X/)- 
There's also a symmetric version, which is easier to use: if the negative dependency (di)graph has all neighborhoods 


size with size at most d, and ep(d +1) < 1, then with positive probability, no bad event occurs. 


ok-1 
e 


We showed using LLL that every k-uniform hypergraph is 2-colorable if every edge intersects at most — 1 other 


edges: color all the vertices uniformly at random. What if we refine that argument a bit? 


Proposition 6.37 


A 3 a o ok 
Every k-uniform hypergraph is 2-colorable if every edge intersects at most + — 2 other edges. 


Solution. For each edge f, we defined the bad event A; if f was monochromatic, but now, let’s let Ar. be the event 
that all vertices in f are colored with color c. It is clear that Pr(Ar.-) =2-*. 

Let our colors be 0 and 1. Consider the graph where A-- and Ar1—¢ are adjacent if f and f’ intersect. Edges 
of our dependency graph then come from two intersecting edges with opposite colors: we claim this is a negative 


dependency graph. The intuition is that compared to the vertex overlap graph, this one tells us a little more: if we 
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have an edge f, and we want to know the probability of f being colored all red, it’s more likely if we know f" is not 
colored all blue. 


‘ % e ok 
So if the degree of the negative dependency graph is at most d+ 1, then ep(d +2) <1aslongasd< ae: 


and applying the Lopsided Lovasz Local Lemma yields the result. 


6.10 Latin squares 


Let’s now do an example where the lopsidedness matters: 


Definition 6.38 


A Latin square of order n is an n x n square filled with n symbols (usually 1 through n), such that every symbol 


appears exactly once in each row and column. 
A transversal of an nx n array is a set of n entries with one per row and one per column. A Latin transversal 


has distinct entries in the Latin square. 


Conjecture 6.39 


All odd order Latin squares have a Latin transversal. 


This is still open, so we're not going to prove it. Instead, we'll show the following weakening of that result: 


Theorem 6.40 (Erdds, Spencer) 


Every n x n array where every entry appears at most q. times has a Latin transversal. 


There's an open problem that every odd-n Latin square has a Latin transversal: this is a looser restriction of that. 
(It’s not necessarily true for even n, but the conjecture is that we can always find (n—1) different entries in that case 


in a “rook placement.”) 


Remark (Historical note). These objects are called Latin squares, because Euler started playing with them and wrote 


a paper where they had Latin entries instead of numbers. 


It's not really clear what the bad events can be here, but the point is that different permutations seem to involve 
every row and every column, so there’s not that many disjoint supported variables. That makes it hard to apply the 


vanilla LLL. But some of the dependencies only help us, so we can apply the lopsided version. 


Proof. Pick a transversal uniformly at random: this is equivalent to picking one of n! permutations uniformly at 
random. Then we can have our bad events be of the form Ajjx;, where (/,/) and (k, /) are both picked, and they have 
the same entry written in them. (Note that the second part of this condition is not random: it is already determined 
by our initial array.) If we avoid all such bad events, we get a Latin transversal. 

The probability of OAjjx: is 0 if the two squares (/,/) and (k,/) are in the same row or column, since we can’t pick 
them both in our Latin transversal. For all others, the probability is eee since there are (n — 2)! permutations that 
go through these two points. 

If we try to do a dependency graph problem like the Lovasz Local Lemma, we run into a problem: most bad events 
are dependent on each other, but what about the negative dependency graph? 


Let G be the graph (V, E), where 
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- V consists of unordered pairs {(/,/), (k, /)} that have the same entry and aren’t in the same row or column. 
© {(,/), (kK, D} and {(,/), (kK, ’)} are connected by an edge if and only if there are any rows or columns that 
are shared: {i,k} N{/, kK} ASor VAY NYAS. 


If we can show this is a negative dependency graph, then we can finish in the following way: given any pair of vertices, 
another pair is in the negative dependency graph if we pick one of the (4n — 4) squares in the same row or column, 
and then pick one of the 75 — 1 other squares with the same entry. Then the maximum degree is at most na) —1, 
and therefore ep(d + 1) < 1, meaning we can apply the Lopsided Lovasz Local Lemma. 

So to show that we do have a negative dependency graph G, fix some /, j, k, /. Let J be a subset of the nonneighbors 


of the bad event {(/,/), (k, /)}. Our goal is to show that 


Pr {| Aijxi | \ Aaj | < Pr(Aijes). 
injkled 

Without loss of generality, we can assume {(i/,/), (k,/)} = (1,1, 2,2) by permuting rows and columns. We need to 
show that the probability that the transversal goes through (1,1) and (2, 2) is at least as large as it is conditioned on 
not going through any of the other events Ajrjrqry. 

So we're not conditioning on any information through the first two columns or rows. How many transversals are 
there satisfying these conditions? Set S, to be the number of transversals through (1, 1r), (2,5) that avoid all bad 
events in J. Note that the S,<s over distinct r,s € [n] partition Nigjreen Aigner into n(n — 1) subsets. Our goal is to 


show that |S1,2| < |S;,5| for all r,s: if we show this, it would follow that 


|S1,2| 2 1 


P Aij Aner | = . 
r ikl | \ Hy k'l Se. IS;.s| = n(n _ 1) 


ik VES 
So we're saying that not having any bad events in rows 3 through n only helps us in the 1,2 case. To do this, we'll 
construct an injection: Sj,2 — S;,; by swapping the 1st and rth row, as well as the 2nd and sth row. This can’t cause 


any bad events to occur, because all potential bad events occur in the bottom (n— 2) by (n— 2) square! This finishes 


the proof. 


Notice that the hardest part of the proof is finding the right bad events and dependency graph to work with: the 


rest is fairly straightforward. 
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¢ Correlation and Janson’s inequalities 


7.1 The Harris-FKG inequality 


We're often interested in understanding probabilities related to a graph G(n, p): for example, we might want the 
probability of some event occurring, such as the probability of having a triangle (which we don’t know how to compute 
yet). 

But we can also ask questions like “does this probability increase or decrease if we condition on some other event?” 
For example, if we have a Hamiltonian cycle, does this increase or decrease our chance of having a triangle? 

These don’t seem all that relevant to each other, but let's look more closely. Having a Hamiltonian cycle is 
indicative of having more edges in the graph, and thus the probability of having a triangle should go up. This isn’t a 
proof, but the basic idea here is correct. Similarly, if we know that our graph is planar, we should expect fewer edges, 
so we should expect a smaller probability of having a triangle. We can rigorize this! 

Let’s say we have n independent Bernoulli variables x,,--- , x, (for most applications, the probabilities are the 
same). An event A is increasing if changing some Os to 1s never destroys the events. In other words, if x < x’ 
pointwise, and x € A, then x’ € Aas well. Notice that we can view A as a subset of {0,1}”: this is an up-set, because 


it’s closed upwards. Similarly, decreasing events are down-sets. 


Example 7.1 


Let's say we have a graph G(N, p), and we have n = (a random variables for the edges. Having a Hamiltonian 


cycle and being connected are increasing, while having average degree at least 5, planarity, and being 4-colorable 


are decreasing. 


Proposition 7.2 (Harris inequality) 


If A and B are increasing events of independent boolean variables, then 
Pr(An B) > Pr(A) Pr(B). 


We also have that Pr(A|B) > Pr(A). 


(Both of these imply that A and B are positively correlated.) More generally, we can let each Q; be a probability 


space that is linearly ordered (for example, {0, 1} in the case above). 


Definition 7.3 
A function f(x,,--+ , Xp) is monotone increasing if given two vectors x < x’ (pointwise in every coordinate), 
C0) SCL). 


Theorem 7.4 (More general version of the Harris inequality) 


Let f, g be increasing functions of independent random variables x1,--- ,X,. Then 


(This implies the first version by picking f and g to be indicator functions.) 
Later generalizations were made by “FKG” (Fortuin, Kastelyn, Ginibre), but we won't discuss FKG inequalities in 
their full generality. The idea is that we can relax the independence condition, use a distributive lattice, and use a few 


other conditions. 
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Proof of the Harris inequality. We will use induction on the number of random variables. If n = 1, then we're saying 


that given two functions f, g that are monotone increasing on one variable, 
E[fg] > E[f]E[g]. 


This is due to Chebyshev (the same guy as earlier in the course): the proof is that picking x, y independently, 


0<E((F(x) — F(y))(9(x) — 9(y))] 


since f(x) — f(y) and g(x) — g(y) have the same sign, and expanding this out, 


= 2E[f 9] — 2E[fJE[g], 


implying the result for one variable. Now for the inductive step, let h = fg. We'll fix x,, defining a new function 


fi(x) =E[f | x; = x]: 


basically, fix one of the variables in our function f. Likewise, let gi(x) = E[g | x; = x] and similar for hy. We know 


that f,, g1 are monotone increasing functions on the remaining variables. Note that 


hy (x1) = fi(x1)91(h1) 


by the induction hypothesis, so 


[fg] = E[h] = E[hy] 


(letting x, be random as well now), and pointwise, this is 


2 Elfig:] > E[AJE[g] = ElfJE[g] 


by the base case, since f; and g; are one-variable functions. 


Corollary 7.5 


Decreasing events are also positively correlated: 


Pr(An B) > Pr(A) Nn Pr(B). 


(Take the complement of a decreasing event to get an increasing event.) Similarly, if one event is increasing and 


another is decreasing, they are negatively correlated. Finally, if all Ajs are increasing or all decreasing, we can say that 


Pr(Ay --- Ag) > Pr(Ar) --- Pr(Ax). 


7.2 Applications of correlation 


Example 7.6 


Let’s find the probability that G(n, p) is triangle-free. 


There are lots of possible appearances of triangles, and lots of dependent probabilities. But we know that these 


events are all correlated! In particular, let Ajj, be the event that (/, j,k) is not a triangle. Ajj, is a decreasing event 
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(on the edges), and this means the probability of having no triangles at all is (by Harris’ inequality) 
=Pr{ A Aix | > TP Prin) = (1 — 2°). 
yk ik 
How close is this to the truth? Taking p = o(1), we can approximate this as 


> en (1+0(1))p3n?/6- 


(By the way, the probability G(n, p) is triangle free is monotone for p by coupling - having a higher chance of 
including each edge just makes our chances worse.) One way to obtain an upper bound is by Janson’s inequality, which 


is kind of dual to the Lovasz Local Lemma, but we'll see that in the next few sections. 


Example 7.7 


What is the probability that G (n, 5) has maximum degree at least 4? 


Let Ay be the event that the degree of v is at most 3: each of these has probability at least 5, so by the Harris 
inequality, the probability is at least the product of the individual vertex probabilities, which is at least 27”. 


Is this close to the truth - what's the actual value? It turns out the probability is indeed of the form (c + o(1))”. 


i 
2 $ 
very good? 


Is Cc = 5, meaning that our correlation inequality is essentially tight, or is c > 5, which means our lower bound is not 


Theorem 7.8 (Riordan-Selby, 2000) 


The probability that G (n, 5) has maximum degree at least 3 is (c + o(1))", where c © 0.6102. 


This is very technical, but let's do a “physicist proof” to see where the number comes from. 


Solution motivation. Use a continuous model instead. Instead of making each variable Bernoulli, put a standard normal 
distribution on each (undirected) edge of K,, instead. Now the degree is just the sum of the standard normals of the 
edges connected to each vertex. 

Let W, = ies Zyy be this sum: We know each event W, < 0 has a $ chance of being at most 0. What's 
the probability all Wys are less than 0? We know the Wys are a joint normal distribution, entirely dependent on their 
covariance matrix. Then the variance of W, is n— 1, and the covariance between W,, and W, is 1 because of the 
shared edge. So now we can directly compute by using a different model with the same covariance matrix: this is 
identically distributed as 


Vvn—2(Z),---,Z,)+ Zo(1,1,---,1) 


(where each Z’ is a standard normal distribution independent of the others). 
Then what is the probability that this vector has all entries less than or equal to 0? That's an explicit calculation: 
let © be the cumulative distribution of the standard normal distribution: conditioning on 24, we find that the desired 


probability is 
a < ee? /2 () 
V2 Joo Yn-2 


where dz refers to us picking Z{ first. To evaluate this, we can substitute z = y\/n for scaling, this integral becomes 


nf ont 
vx fay. 
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where f(y) = a —log® (y [5). We can pretend f(y) = ye —log &(y) and bound the error, and then look at the 


asymptotic property of this integral as n + oo. Well, there’s a general principle: 


If f is a “sufficiently nice” function with a unique minimum at yo, then 


i et dy = (e-foo) + o(1))". 


Basically, as n > oo, we only get contributions from the smallest f(y). So the rest is just finding the right value 


of yo: we can just do this by taking the derivative, and that yields c © 0.6102 as desired. 


The actual proof is very technical, but this is just a general idea to explain where the constant c potentially comes 


from. A lot of probability theory now Is rigorizing physical intuitions! 


7.3 The first Janson inequality: probability of non-existence 


The Harris-FKG inequality gives us lower bounds on the probabilities of certain events, but those are not necessarily 
tight bounds. We'll now start to explore some methods of obtaining upper bounds that are hopefully close to the 


Harris lower bounds. 


Setup 7.10 
Pick a random subset R of [N], where each element is chosen independently (usually with probability 5). Refer 
to [N] as the “ground set.” Suppose we have some subsets S;, S2,--- ,S,% CG [N], and A; be the “bad event” that 
SS Rat Onmallee 

Denote X = )7, 1,, to be the number of Ajs that occur. Note that 


= E[X] = > /Pr(Ai), 


and we have a dependency graph i ~ J if i A jy and S$; S; # S (the two underlying subsets overlap). Much like 


in our earlier covariance calculations, let 


= OFF A 
(i):ind 


(A is an upper bound on the variance.) For example, in the current problem we're considering, the Sis are the 
triangles, and [N] is the set of edges. 

Back with the second moment method, we found that if the standard deviation was small relative to the mean, 
then we have concentration, so we want A to generally be small. Janson’s inequalities are going to give us better 


control over our concentration! 


Theorem 7.11 (First Janson inequality) 
With the definitions in Setup 7.10, the probability that no bad events occur is 


Pr(X =0) < e#t?, 
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So if A is small relative to the mean, then we are essentially upper bounded by e-“. By the way, this is pretty close 
to the truth: if all bad events occur with some probability p = o(1), and A = o(w), then our lower bound from Harris 
is 

Pr(X = 0) > | [ Pr(A)) = e Oto 
i 
by using (1+ x)" = 1+ nx. The original proof interpolated the derivative of the exponential generating function, but 


we'll look at a different one. 


Proof by Boppana and Spencer, with a modification by Warnke. Let 
ri = Pr(A; | ArAa-+- Aj-1) 


(so this is conditioned on the probability that none of the previous bad events occur). Then the probability that no 


bad events occur Is a chain of conditional probabilities: 
Pr(X = 0) = Pr(A1) Pr(Ao | Ai) -++ Pr(Ag | Ar+++ Axa) = (L— n)(1 = 2) +++ (l=) S ee. 
It now suffices to show that for all / € [k], we have 
r= Pr(A;))— S> Pr(Aj A Aj). 
i<ijni 
where the sum only accounts for those A; with j < / that are dependent on A;. Then we'd be done, since we have 


Pr(X = 0) < em EPMA + Ling PHANA) — gut 3 


as desired. Well, the proof of that statement is somewhat reminiscent of the Lovasz Local Lemma proof. 


Fix /, and split up the events into those that depend and don’t depend on /: let Do be A Aj and Dy, be 


<LjAi 
Ajei-jmi Ai- This partitions all events A; with j < /, and now 


Pr(A;DoD1) — Pr(A;DoP1) 
Pr(DoD1) ~  Pr(Do) 


i= Pr(Aj|Do D1) = 


by Bayes’ formula. We're trying to use the fact that Do is independent of A; here, so by Bayes’ rule again, this is 
equal to 
— Pr(A;D;|Do). 


We can write this as the probability 
Pr(A;|Do) — Pr(A;Di|Do). 


By independence of A; and Do, this first term is now Pr(A;). Now because A; is increasing, Dy is increasing, and Do 
is decreasing, we can use the Harris-FKG inequality: conditioning on a decreasing event must make the probability of 


an increasing event go down, so 
Pr(A;D;|Do) < Pr(A;D;) = Pr(Aj NM VicijriAj) 


and by a union bound, this is at most 
< So Pr(AiN Aj), 
j<ijni 


as desired. 
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Fact 7.12 


Janson's inequality is originally about random subsets including a certain S;, but we actually only need that the 


Ajs are increasing events. So they don't necessarily have to be of the form S; C R for our random subset R. 


Previously, our dependency graph had edges wherever the S;s overlapped. In general, there’s a difference between 


pairwise independence and mutual independence, so it seems that we have to be careful. However, we get lucky: 


Lemma 7.13 


If A, B, C are increasing events, and A is independent of B and C, then A is independent of BAC. 


So the pairwise dependence graph is the same as the dependency graph. This may not be that intuitive: remember 
that one counterexample was R C [8], where A; is the event “|RM {1,2}| is even”, Ao is the event"|R 1M {1, 3}| is 


even,” and As is the event “|RM {2, 3}| is even:” we have pairwise independence but not mutual independence. 


Proof. Note that Pr(AA (BAC))+Pr(AA (BV C)) = Pr(AA B)+Pr(AA C) = Pr(A)(Pr(B) + Pr(C)). But on the 


other hand, by Harris-FKG, since the events are increasing, 
Pr(AA (BAC)) > Pr(A) Pr(BA C), Pr(AA (BV C)) > Pr(A)(Pr(B v C)). 


But these last two inequalities actually add to the first equality! So equality must occur, and A is independent of BAC 
and BV C. 


Let's use the first Janson inequality to get an upper bound on the probability that G(n, p) is triangle-free. Recall 


the Second Moment Method calculations that we made earlier: the expected number of triangles is 


_f")\ 3 13,3 
LU (5) = mp 


Meanwhile, A is counting the number of pairs of triangles with a shared edge: these look like 4-cycles with a 
diagonal, and that evaluates out to 
Ax n'p?. 
SoA<uw <= p<«n-/?, and therefore the probability that G(n, p) is triangle free is ete)" if p< n-V/?. 
Note that this is exactly the right asymptotic behavior (see the logic after Theorem 7.11). 


~1/2 | and is Harris a good approximation? Bipartite 


Well, what if p is larger - does the formula still hold when p = n 
graphs are triangle-free, so the probability of a triangle-free G(n, p) is at least the probability that it is bipartite. This 
is at least the probability that G(n, p) has no edges, which is (1 — p)@) = e (1+0(1)pr*/2 for p < 1. This is actually 


much larger than e~° if p >> n~4/2! So the lower bound of Harris is true, but it’s inferior to very stupid bounds. 


7.4 The second Janson inequality 


Let's try to strengthen our inequality when A > wu 


Theorem 7.14 (Second Janson Inequality) 
Again using the assumptions of Setup 7.10, if A > wu, then 


Ps 
Pr(X =0) < e728. 
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Proof. For each subset of the bad events T C [Kk], let 


Xray 14 
i€T 
be the number of bad events in T only, and wr = diiep Pr(Ai), Ar = ici perz,inj Pr(Ai A Aj) be defined similarly. 
Then the probability that none of the bad events occur is always 


Pr(X = 0) < Pr(X7 =0) < eet 


Choose T randomly: include each element independently with some probability q (to be determined). Then yy has 


expectation gu, and Ay has expectation q?u (since both A; and Aj need to be kept for any term to count). Thus, 


ie GA 
DS a as 


i(—pr 4 


Minimizing this, pick q = a which is at most 1 by the theorem statement. This yields we so there exists some choice 
of T so that —u7 + Ar < ae as desired. 


These two Janson inequalities work in different regimes, and It’s interesting that the proof of the second uses the 


proof of the first! 


Remark. This “bootstrapping argument,” where we start with a weak inequality and make it stronger, is reminiscent 
of the crossing number inequality. We had 
cr(G) = |E| — 3/V|, 


and this was only quadratic inn. To get a stronger result, we sampled our graph G, which gave us a much stronger 
[El 


inequality of the form cr(G) Z jp: 

How do these compare to the second moment method calculations? There, we said that ifA< Le, and ue +> oo, 
then X is concentrated around its mean, meaning Pr(X = 0) = o(1). But here, we have an explicit exponential decay, 
rather than just knowing that the probability goes to zero. 

Does this give a better bound than the first Janson inequality for G(n, p) being triangle free (when p is large)? Say 
that p >> n-1/2, so that we have A > w. By Janson’s second inequality, the probability that G(n, p) is triangle free is 
now 


2 Oi). 
The exponent matches the order of “probability G(n, p) has no edges” from above, which means this is essentially 
tight! So 
a7 (1+0(1))n? p?/6 p< n7}/2 
Pr(G(n, p) is triangle free) = : 
e O(n p) p 2 noh/2. 
What is the constant here in the ©? We can do a bit better than “G(n, p) has no edges,” since G(n, p) has probability 


of being bipartite at least 
(1-— py’) — e-(1+0(1))n?p/4 


Is this the dominating way of generating graphs with no triangles? The answer turns out to be yes, but we don’t yet 


have the tools to show that. The modern way to think about this is through something called the container method. 
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Fact 7.15 


A lot of probability distributions that are nonnegative-integer-valued are Poisson or Gaussian. In particular, for the 


p< n-'/? case, if the exponent converges to a constant, we get Poisson behavior for the number of triangles in 
G(n, p). 


7.5 Lower tails: the third Janson inequality 


One more time, let X be the number of triangles in G(n, p). This time, we want to estimate the probability that X is 


at most 0.9E[X] (or some other constant times the mean): this is a generalization of estimating the probability that 
X=0. 


This is on a larger order than the standard deviation of X, so Chebyshev-like tools don't help us. It turns out that 


in these cases, we have exponential decay: 


Theorem 7.16 (Third Janson inequality) 
Use the assumptions in Setup 7.10 again. For anyO<t<u, 


Prix & j— tf) S exp Ge 


We'll come back to the proof of this later - interestingly, it also bootstraps the first Janson inequality. First, let's 
look at a consequence of this by looking some more at triangle counts (here, triangle can also be replaced with any 


subgraph H). If we still let X be the number of triangles in G(n, p), and we let t = cu = n° p? for some constant c, 


then bck 
Pr(X < (1—c)E[X]) < exp (-e (__"” 
= " = n3 p3 + nt p> . 


We can clean this up a bit by splitting by dominant term: 


exp(—O(n?p3)) pS nv? 
exp(—O(n?p)) pp Zn? 


Pr(X < (1 — c)E[X]) = 


Are these inequalities tight — that is, do we have a corresponding lower bound? Turns out the answer is yes! The 


probability of having at most (1 — c)E[X] includes the probability that there at exactly O triangles, and the upper 


bounds that we just found have exponents on the same order as what we found for the O-triangle case. This means 
our bounds are tight up to constant factors in the exponent. 

Unfortunately, we don't actually know the value of the constants except for some values of c. We are essentially 
asking “what Is the best way of getting few triangles?”, and one good way to do this is to uniformly decrease probability 
of edges everywhere, which helps for small c. On the other hand, when c is close to 1, we expect that bipartite graphs 
dominate the few-triangle space. However, It’s still a research problem to look at values of c in between and find the 
dominating graphs. 

By the way, there's a reason we're only mentioning the lower tail: the upper tail is completely different. If we use 
X > (1+ c)E[X], our inequalities are false! 


Proof of Theorem 7.16 by Warnke. Define the parameter q € [0,1] (value to be determined later). Let 7 C [k], 
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where each element is included with probability g independently; let’s consider 

LS 

eT 

We can alternatively write this as a sum over the original bad events: 

= S- 1a,Wi, 

i€[k] 
where each W; is distributed according to a Bernoulli distribution: 1 with probability q if i € T and O otherwise. 
Notice that X, our actual number of bad events, tells us 


Pr(X7 =0|X) =(1-q)*, 


because this is the probability that none of the bad events that occurred are included in T. Taking expectations on 
both sides, 


o[(1 — g)*] = Pr(X7 = 0) < eH t4/? 


by the first Janson inequality, where yu! = E[X7] = qu and A’ (similarly) is = g?A. We can think of the left side as a 


moment generating function if we want to get exponential bounds! 


So now by Markov’s inequality, 


Pr{iXsp=t)=Pri(i-q)* > (—9¢)?")< (4) YE gq)" | 


and plugging in our result, 
S (Lager: 


Now we just optimize for g. One thing we can do is take the derivative and set things equal to 0, but instead, we can 
give ourselves some slack: let’s let 1 — g = e~* for some A > 0. By the Taylor expansion, A — x <q<X. Plugging 


this in, 


x aK Xe 
Pr(X > wt) sem] t) (a =) 5 | =| At 4 x (H+ A) ; 


and now set \ = 7x to get the result. 


Notice that the proof of Janson’s third inequality only works for lower tails: in particular, the proof starts by using 


the probability that X = 0 and builds up from there. In fact, upper tails are completely different! Let's show that for 


poem, 


Pr(X > (1+ c)E[X]) > eo". 


To do this, we ask: what's the “cheapest” possible way to generate lots of triangles? If we plant a clique, we get 


something pretty good: 


Pr(X > 2E[X]) > Pr(G(n, p) has a clique on the first 10np vertices), 


because this already gives (*°”") triangles, which is more than 2(3)p?. Then the probability G(n, p) has a clique in 


the first 10np vertices is 
Bee S acre log(1/p) 


and this is exponentially larger than the previous bound for lower tails. 
So what's the truth here? There was a paper written about the “infamous upper tail:” it just demonstrated that a 


wide range of techniques did not work for showing bounds on the upper tail. About 10 years ago, though, the following 
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result (on the same order of the planted clique) was proved: 


Pr(X > 2E[X]) = e- Or?" loalt/0)) if p > O97, 
> 2 


The constant in the © here is a bit tricky, too. Basically, there are two main constructions for generating lots of 
triangles: have a hub of size cnp? and connect them to everything else, or have a clique of some select size. But this 


isn't obvious, and there's lots of open research here! 


7.6 Revisiting clique numbers 


Recall that in our Second Moment Method discussions, we found that 


W (c (n 5)) ~ 2logo(n) 


with high probability (this is Theorem 4.21). We actually found two-point concentration: let's review the ideas of the 


proof so that we can look at them more closely. 


Proof. Let X, be the number of k-cliques in G (n, ): the expected value of X; is then 


wiry = (QJ. 


If k is a function of n, and our mean uz — 0, then X = 0 with high probability by Markov's inequality. Meanwhile, if 


Uk oo, thenA< >: variance is much smaller than the squared mean, so by the second moment method, there is 


always a k-clique with high probability due to concentration. 
Where does the quantity tu, cross a threshold? We defined kp to be the largest k such that wz, > 1. Another 
routine calculation shows us that around that value of k ~ 2log» n, 


Mk+1 _ noi te(t) 
Lk 


and this implies that the clique number of G (n, s) is concentrated around 2logs n with high probability: in fact, it’s 
either kg or kg — 1 with high probability. 


Problem 7.17 


Is the clique number of G (n, 5) really concentrated around two different points, or do we just have a weakness in 


our proof? 


Solution. Let’s say that we have k as a function of nso that w = (7)2-@) — c converges as n, k + oo. Doing the 
relevant calculations, we find that A = o(1), so A < yu here. For all S that are k-element subsets of [n], let As be 
the event that S forms a clique: then the probability for any given As is a-(G). Then X = 0, which is the probability 


pr(u(6(n2)) <x) cern” 


by the first Janson inequality, and we also have e#(+°()) as a lower bound by Harris. So because A < Lt, the lower 


Pr (w (c (x 5) < k) =e (1+o(1))H _, em e 


#1 


that there are no k-cliques, is 


and upper bounds grow closer: 


which is some specific value between 0 and 1. To be more rigorous about this, we're saying that for A € (—oo, 00), if 
No(k) is the minimum n such that (7)2-G) > 1, then n = no(k) (1 + Are), and thus the mean 


L= @rac = e*+0(1). 
So that means that 


1 k—1. with probability 1 — e-® + 0(1 
a(6(n3)) = p y ; (1) 
2 k with probability e~®* + o(1) 


and we can get sequences with two-point concentration of any probability we want. 


Fact 7.18 


On the other hand, most n do have one-point concentration, so both cases (one- and two-point concentration) do 


occur. One way to view this problem in general is that these situations look sort of Poisson in their distribution, 


and that’s the right kind of case to use Janson’s inequalities. 


7.¢ Revisiting chromatic numbers 


We're ready to go back to thinking about this result from earlier in the class now. Remember that Corollary 4.23, we 


1 
Qa (c (n 5)) ~ 2logon 


with high probability, and because each color class is an independent set, 


found that the independence number 


n n 


X(G) > oes > + OW ETH 


with high probability. This is a lower bound: the methods we had before didn’t allow us to get an upper bound, but 


now we have the Janson inequalities. 


Theorem 7.19 
We have with high probability that 


e 2logs(n) 


Proof. The idea is that we will use a “greedy coloring,” since our goal is to show that we can color all of G using 
(1+ OC) sisash colors with high probability. 

At each step of our strategy, we take out an independent set of size (1 + o(1))2logo(n). Each time we do that, 
color all of those with one of the colors, and remove the independent set from our graph. We'll stop when we have 
O (2s) vertices left: at that point, we finish by just coloring with a different color for every vertex. 

Why can we perform that step of removing an independent set of size (1+ 0(1))2logs(n)? It is sufficient to show 
that with “very” high probability, every “not-too-small” subset of G has an independent set of size (1 + 0(1))2 logs n: 


here “very high” means exponentially small. 


#2 


Lemma 7.20 
There exists k ~ 2logs n such that 


By extension, this is also true for the independence number of G. 


Proof of lemma. Use the notation from the clique number calculations (specifically ko). Let k = kg — 3 (so that we 
have a significantly smaller number of cliques): what’s the probability that we don't have any k-cliques? This should 
be very small if there’s high probability of kg-clique! In particular, the mean of the number of k-cliques is (because we 


change by a factor of n each time we change k by one) 


Uk > oe, 


We can also calculate 2 
> ntto(l) x ib 


1 2 6 
_ (» (< (» ) - k) tenege 


This decays very quickly. So now given a graph G~G (n, 5). let's take m= lata | for every set of m vertices, 


let G[S] be the graph induced by the vertices S. The probability the independence number of G[S] is less than k is 


so by the second Janson inequality, 


as desired. 


ann = anno 
because the G[s] looks like G (m, 5) for some k ~ 2logg mM ~ 2logy n. Summing over all S and doing a union bound, 
the probability that 
Pr(a(G[S]) < k for some S) < 2"e-” ” = 0(1), 


so with high probability, every m-element subset of G contains a k-element independent set. Thus we can carry out 


our greedy coloring, and the total number of colors we use (including the last part where we use one color for each 


vertex) is 
n—m 


+m=(1+ 0) at55,0)" 


as desired. 


Note that this proof only works because we can get an exponential bound from the Janson inequalities! Bollobas’ 


theorem guarantees some kind of concentration, but the window of deviation is still basically an open problem. 
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8 Martingale convergence and Azuma’s inequality 


8.1 Setup: what is a martingale? 


Definition 8.1 


A martingale is a sequence of random variables Zo, 71,--- , such that for every n, E|Z,| < co (this is a technical 


assumption), and 


DY are Pr Ps 


This comes up in a lot of different ways: 


Example 8.2 


Consider a random walk X1, X2,--- of independent steps, each with mean 0. Then we can define the martingale 
Ee = SS Gr 
i<n 


which fits the definition because we always expect our average position after step n+ 1 to be the same as where 


we just were after step n. 


Example 8.3 (Betting strategy) 
Let’s say we go to a casino, and all bets are “fair” (have expectation 0). For example, we may bet on fair odds 
against a coin flip. Our strategy can adapt over time based on the outcomes: let Z, be our balance after n 


rounds. This is still a martingale! 


This is more general than just a random walk, because now we don't need the steps to be independent. 


Example 8.4 


Let’s say my goal is to win 1 dollar. | adapt the following strategy: 


« Bet a dollar; if | win, stop. 


* Otherwise, double the wager and repeat. 


This is a martingale, because all betting strategies are martingales. With probability 1, we must always win at some 
point, so we end up with 1 dollar at the end! This sounds like free money, but we have a finite amount of money (so 


this would never occur in real life). 


Remark. This is called the “martingale betting strategy,” and it's where the name comes from! 


Definition 8.5 (Doob or exposure martingale) 
Suppose we have some (not necessarily independent) random variables X1,---,Xn, and we have a function 
f(X1,-++ ,Xn). Then let 


ULF (x1, °°+ Xn) |X, ieee XG 
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Basically, we “expose” the first / outputs to create Z;. It’s good to check that this is actually a martingale: show 
that 


"[Zi|Zo, °° , Zi-1] = Zi-1. 


Note that f may also be some random variable: for example, it could be the chromatic number of the graph, and 


X; are indicator variables of the edges. Then Zp is E[f], Z1 is a revised mean after we learn about the status of an 


edge, and so on. This Is called an edge-exposure martingale. 

Let's discuss that more explicitly: we reveal the edges of G(n, p) one at a time. For example, let’s say we want 
x(G(3, $)). There are eight possible graphs, with equal probabilities, and six of them have chromatic number 2, one 
has chromatic number 3, and one has chromatic number 1. The average is Zp = 2. 

Now, the chromatic number is either 2.25 or 1.75, depending on on whether the bottom edge is present or not. 
This average is 2, and then we can keep going: Z2 Is either 2.5 or 2 if Z; = 2.25, and 2 or 1.5 if Z; = 1.75. The idea 
is that each Z,41's expected value is dependent on the previous mean. 

Alternatively, we can have a vertex-exposure martingale: at the /th step, expose all edges (j,/) with J < i. So 


there are different ways of constructing this martingale, and which one to use depends on the application! 


8.2 Azuma’s inequality 


Why are martingales useful? Here’s an important inequality that’s actually not too hard to prove: 


Theorem 8.6 (Azuma’s inequality) 


Given a martingale Zg,--- , Z, with bounded differences 
4 = Zj-1| <1Vie [n], 


we have a tail bound for all X: 
Pi Ze ee 


More generally, though, if we have |Z; — Z;-1| < c for all / € [n], then for all a > 0, 


Pr(Zp— Zo =a) S —=r ss 
2a 222) <00 (sa) 


(This is sometimes also known as Azuma-Hoeffding.) We've seen this before from our discussion of Chernoff 
bounds, which is a special case by making the martingale a sum of independent random variables! This is not a 
coincidence - we'll notice similarities in the proof. 

This theorem is useful when none of the martingale steps have big differences. It is generally more difficult to prove 


any sort of concentration when we don’t have bounded differences in our martingale, though. 


Proof. We can shift the martingale so that Zp = 0. Let X; = Z; — Zj_1 be the martingale differences: these X;s do 


not need to be independent, but they must always have mean 0. 


Lemma 8.7 


If X is a random variable with E[X] = 0 and |x| < c, then 


Cc 


i[e*] < — <i Ge/2) 


by looking at Taylor expansion and comparing coefficients. 
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Proof of lemma. Basically, we maximize e* by having a +c Bernoulli variable - this is true because of convexity! 


Specifically, we can upper-bound e~% by the line connecting the endpoints: 


Kee 4 BSR 


~ 2 


and now take expectations of the statement when x = X. 


£ 


tZn 


So now let t > 0, and consider the moment generating function E[e'“"]. We'll split this up as 


J 


i[et2"] _ [et | Zn-1)] =k [ 7 [er | cee iad 


By definition, the inner expectation is the moment generating function of a mean-zero random variable bounded by 


tCn, and thus 


t[et2>] < et i/2 s[etZr2], 


Repeating this calculation or using induction, we find that the expectation of e'2" is bounded by 


t2 
< exp 5 (Cn bce te +cf)}. 


To finish the proof, we repeat the logic of the Chernoff bound proof: by Markov's inequality on the moment generating 


function, 


Pr(Zy > a) < e-*Elet29] < e tat Pla tten)/2, 


We can now set t to be whatever want: taking t = =*= yields the result. 


The main difference from Chernoff is that we do one step at a time, and this crucially requires that we have 
bounded differences. We can also get a lower tail for Z, in the exact same way, and putting these together yields the 


following: 


Corollary 8.8 
Let Z, be a martingale where |Z; — Z;-1| < c; for all / € [n], as in Theorem 8.6. Then for all a > 0, 


Pr(|Zn — Zo| > a) <2 —=— | - 
H\Zn~ Zo] > a) < 200 (55) 


Basically, we can’t walk very far in either direction in a martingale with an interval of ./n, even when our choices 


can depend on previous events. 


8.3. Basic applications of this inequality 


The most common way Azuma is used is to show concentration for Lipschitz functions (on a domain of many variables). 
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Theorem 8.9 
Consider a function 
f:Q, xX Qs x*-++xXQ,>R 


such that |f(x) — f(y)| < 1 whenever x and y are vectors that differ in exactly 1 coordinate. (This is known as 


being 1-Lipschitz with respect to Hamming distance.) Then if Z = f(X1,--- , Xn) is a function of independent 


random variables X; € (;, we have high concentration: 


Pr(Z — E[Z] > AV/n) < e”?. 


Proof. Consider the Doob martingale 


Fe BIZ Ry Ki 


Note that |Z; — Zj-1| < 1, because revealing 1 coordinate cannot change the value of our function by more than 1. 


But now Zo is the expected value of our original function Z, since we have no additional information: thus, Zo = E[Z]. 


Meanwhile, Z, means we have all information about our function, so this is just Z. Plugging these into the Azuma 


inequality yields the result. 


It’s important to note that the |Z; — Z;-1| < 1 step only works if we have independence between our random 


variables - that step is a bit subtle. 


Example 8.10 (Coupon collecting) 


Let’s say we want to collect an entire stack of coupons: we sample s1,--- , 5, € [n]. Can we describe X, the 


number of missed coupons? 


Explicitly, we can write out 
A= lal Sie Sab | 


It's not hard to calculate the expected value of X: by linearity of expectation, each coupon is missed with probability 


(1 5)", so 
a [Xx] =n(1- ae 


This value is between nok and %. Typically, how close are we to this number? Changing one of the sjs can only change 


X by at most 1 (we can only gain or lose up to one coupon). So by the concentration inequality, 


Pr (|x = “| S Aan 1) < Pr(|X — E[X]] > AV) < 2e*/?, 


where the +1 is for the approximation of 4 we made. So the number of coupons we miss is pretty concentrated! This 
would have been more difficult to solve without Azuma’s inequality, because whether or not two different coupons are 


collected are dependent variables. 


Theorem 8.11 
Let €1,£0,--: ,€, € {—1,1}” be uniformly and independently chosen. Fix vectors vj,--- , Vp, in some norm space 
(Euclidean if we'd like) such that all vectors |vj| <1. Then X = |l€yvy +---+€,Vp,l| is pretty concentrated around 


its mean: 


Prix || = Xv) = e872 


Tl 


Even if we can’t compute the mean of this variable, we can still get concentration! Note that if the vjs all point 


along the same axis, then we essentially end up with the Chernoff bound. 


Proof. Our Q;s are {—1, 1}, and we have a function defined as 
f(€1,°°* ,€n) = llerv, +--+ + EnVpl|. 


If we change a coordinate of f, the norm can change by at most 2 by the triangle inequality, because each vj has norm 


at least 1. Plugging this into Azuma, we find that 


Pr(|X — E[X]| > AVn) < 2e8. 


This is usually good enough (the exponent is of the right order), but with a little more care, we can change the 
constant in our exponent from 3 to $. Let’s go back to the exposure martingale, and let Y; be the expected value of 
our function f after having €1,..., €; revealed: we claim that |Y; — Y;-1| < 1. 

Why is this the case? If we've revealed the first / — 1 coordinates, let € and é’ be two vectors in {—1, 1}® differing 
only in the /th coordinate. Then - 

vate = YOM) 

we should average over what happens in the /th coordinate if we already know what happens in the /th coordinate. So 
now plugging in, 


vile) —Yeale)| = SIM) — WAI < 5 -2llvill <1, 


Nie 


and now Azuma gives us the desired constant! 


8.4 Concentration of the chromatic number 


Last time, we derived (using Janson’s inequality) an estimate for the chromatic number of a random graph, and this 
took some work. But it turns out that we can prove concentration of the chromatic number without knowing the 


mean: 


Theorem 8.12 
Let G = G(n, p) be a random graph. Then 


Pr (Ix(G) — El[x(G)]| < AVn—1) < 26”? 


To prove this, let's think about the process of finding the chromatic number as a martingale. This does not even 


n : 


require us knowing that x(G) is about aig 


proving concentration is somehow easier than finding the mean here! 


Proof. There are many ways to expose the edges of a graph: sometimes we need to choose between edge and vertex 
exposure. Here, we'll do the latter. 

Consider the vertex-exposure martingale. Basically, we're given the status of all edges connected to one of the 
first / vertices, and then we try to figure out the estimate from there. Note that |Z; — Zj;-1| < 1: we could always just 


give the /th vertex a new color to preserve proper coloring, so the expected chromatic numbers can't differ by more 


than 1. But now we're done by applying Azuma’s inequality. 


Let’s try seeing what happens if we used the edge-exposure martingale instead. We have more steps: there are 
(5) edges to reveal, so we should get about O(n)-size deviation. That's already not so good, since we're trying to 


find chromatic number (which is size n). We can't even bound |Z; — Zj;-1| any better than before! 
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Remark. /t’s important to note that in general, it’s not always better to use the vertex or edge exposure martingale. 


Instead, our method really depends on what the maximum differences are between subsequent steps. 


As a final note, can we also set up a Lipschitz function to rephrase the setting of this problem? Our random 
variable spaces in the vertex-exposure martingale, Q;, aren’t edges but batches of edges: at step /, we want {0, aes, 
the edges going “left” from vertex /. So the vertex-exposure partitions our edge and exposes groups at a time: if we 
do the batching appropriately, we get a better gain than naively having our probability spaces just be {0,1} for each 
edge. So the (2s are fairly general, but they must still be independent. 

The idea of a tail bound is the same as a confidence interval: Azuma tells us that we can take an interval with 
width on the order of ./n, and at least some constant fraction of our random graphs will give chromatic number in 
that interval. This might be overly generous, though: it’s a major open problem to know the actual fluctuation of the 


chromatic number of a random graph! 


8.5 Four-point concentration? 


Interestingly, we can get even better concentration if we have sufficiently small p: 


Theorem 8.13 


Let a > 2 be a fixed constant. If p < n~%, then the chromatic number x(G(n, p)) is concentrated among four 


values with high probability. Specifically, there exist a function u = u(n, p) such that as n > oo, 


Pr(u < x(G(n, p)) < u+3) =1- (1). 


In other words, sparser graphs are easier to estimate. Because the probability of an edge appearing here is relatively 


small, we can get more concentration than with our earlier calculations. 


Proof. \t suffices to show that for any fixed € > 0, we can find a sequence u = u(n, p,€) such that as n> co, 
Pr(u < x(G(n, p)) <u+3) >1-—e-o0(1). 


Pick u to be the smallest positive integer such that Pr(x(G(n, p)) < u) > €. (This is deterministic, even if we may 
not know how to evaluate it.) Then the probability that x(G) < u is at most €, and we just want to show that 
x(G) > u+3 with probability o(1). 

The next step is very clever: let Y = Y(G) be the minimum size of a subset of the vertices S C V(G) such that 
G — S may be properly colored with u colors. Basically, we color as well as we can, and Y tells us how close we are to 
SUCCESS. 

Now Y is 1-Lipschitz with respect to the vertex-exposure martingale: if we change a vertex in G, then Y(G) 


changes by at most 1. So by Azuma’s inequality, 


Pr(Y <E[Y]—AVn) < e 2, 


Pr(Y > E[Y] +Avn) < 2, 


This trick will come up a lot: we'll use both the upper and lower tail separately. We don't need to know the 


expectation to find concentration, but we'll use the lower-tail bound to bound E[Y]. With probability at least €, G is 


u-colorable. That's equivalent to saying that Y = 0, which occurs with probability 


€ < Pr(Y < E[Y] — E[Y]) < exp (- TT) 
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Simplifying this, this already gives us a bound (for some A, which is a function of €) 


VY] < 4/2109 (3) n= a0 


which is what we should expect from a martingale of this form. Similarly, we can do an upper-tail bound to show that 


Y is rarely too big relative to the mean: 


Pr(Y > 2AVn) < Pr(Y > E[Y] +AVn) < 2 =e 


by the definition of A. So the number of uncolored vertices is not too big: by the definition of Y, we now know that 
with probability at least 1 — €, we can color all but 2,/n vertices. Here’s the key step: We'll show that with high 


probability, we can color the remaining vertices with just 3 colors. 


Lemma 8.14 


Fix a > 2 as before, as well as a constant C, and let p < n~%. Then with high probability, every subset of size at 


most C./n vertices in G(n, p) can be properly 3-colored. 


We want to union bound the bad probabilities, but we must be a bit careful here. Suppose the lemma were false 
for some graph G (that is, we're in one of the bad cases). Choose a minimal size T C V(G) that is not 3-colorable. 
Consider the induced subgraph G[T] (taking only the edges between the vertices in 7). This has minimum degree 3, 
because if there's a vertex x with degy(x) < 3, then T — x is also not 3-colorable, which contradicts the minimality 
of T. 

So G[T] has at least 3]T|/2 edges, and now we can just bound the probability that there exists some T (of size 
at least 4, since that’s the only way for it to be not 3-colorable) with |T| < C./n that contains at least 3|7|/2 edges: 


<X() in)" 


Now we just need to show that this quantity is o(1) (as n goes to oo): this simplifies to 


C/n 3t/2 c/n 
Z s a (=) n-3te/2 — S- (o(artie)" = oft) 


t=4 


union bounding, this is at most 


ifa> 2. 
So in summary, we know that once we have all but C./n of the points colored with u colors with 1— o(1) probability, 


we have 1 — o(1) probability of coloring the rest in at most 3 colors. Now just take € arbitrarily small to show the 


result. 


The hardest part of this proof is finding an informative random variable that is Lipschitz! It turns out that better 
bounds are known: we actually have two-point concentration for all a > 5, and the proof comes from refinements of 


this technique. 
8.6 Revisiting an earlier chromatic number lemma 


Remember that when we discussed Janson's inequality, we considered the following key claim from Bollobas’ paper, 


helpful for taking out large indepndent sets: 
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Lemma 8.15 
Let w(G) be the number of vertices in the largest clique of G(n, p), and let kg be the minimum positive integer 
such that (f)2-) > 1. Then if k = ko — 3 (here k ~ 2log, n), 


lela) ee 


(This is also Lemma 7.20.) Here's an alternative proof. 


Proof. Let Y be the maximum number of edge-disjoint sets of k-cliques in G. Y is not 1-Lipschitz with respect to 
vertex-exposure: for example, my graph could have a bunch of cliques connected to only one point. However, it is 
1-Lipschitz with respect to edge-exposure (since each edge can only be part of one k-clique anyway). 

So now the probability that w(G) < k is the probability Y = 0 (there are no cliques). Using the lower tail Azuma's 


inequality, 


Pr(Y = 0) = Pr(Y < E[Y] —E[Y]) < exp (-Sr) 

2(3) 
It now remains to show that E[Y] is large: if we can show that the expected value is n?~°), then lower tail estimates 
tell us that it is very rare for Y to be 0. 

So we have this graph G (n, 4), and we're asking how many k-cliques we can pack into it. Remember the problem 
set: the trick is to create an auxiliary graph H whose vertices are k-cliques of G. Then two cliques S, T are adjacent 
in H if they overlap in at least two vertices, so they have a common edge. 

Now Y = a(H) is the size of the largest independent set in H: by Caro-Wei, it suffices to show that H has lots of 


vertices and not that many edges, since we get the convexity bound 


1 \V(A)| 
a(H) > > a 
(2 > Tram T+a 
By the second moment method, |V(H)|, the number of k-cliques in G, is concentrated with high probability around 
its mean, which is (7)2-G) by linearity of expectation. By definition of ko, this is at least n?-°™), and on the other 
hand, the expected number of edges in H is concentrated around AG a = n*te()_ So by Caro-Wei, the expected 


value of Y is 4 
\V(H)| > 72-22), 
|V(H)| + 2|E(H)| 


as desired. But here’s another way to reach that conclusion, which we've seen a few times: the sampling technique! 


E[Y] = Ela(H)] > E | 


Choose a q-random subset of k-cliques in G (each clique with probability q). We'll just get rid of one clique from 


each overlapping pair to get a large a(H). We expect to get E[g|V(H)|] cliques, but E[q?|E(H)|] overlapping pairs 


(since H is random as well). Now pick g to maximize: q = ACE and this means that the expected size of our 


independent set is at least oS and we're done. 
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Q Concentration of measure 


9.1 The geometric picture 


Concentration of measure is an important concept in high-dimensional probability and geometry. We've shown examples 
of concentration of Lipschitz functions of many variables, and it turns out this concept is integrally connected to 


isoperimetry. For example, what’s the largest area we can fence off with a given perimeter? This can be rephrased: 


Problem 9.1 


Given some constant volume V, what’s the minimum possible surface area of that volume? 


An example of a space we can work in is the Hamming cube: if we have an n-dimensional cube, and we label some 
number of points, what's the minimum size of the boundary? We can consider the Hamming distance, the number 
of differing coordinates between two vertices of the cube. It seems that we want to take some portion of the cube 


which is within some Hamming distance of a fixed point (this is a “ball”), which turns out to be true: 


Theorem 9.2 (Harper) 
If B is a (Hamming) ball, and the volume of A is equal to the volume of B, then the volume of A; is at least the 


volume of B;, where A; Is the set of all points within a distance t from a fixed point A. 


What does this have to do with concentration of measure? We can prove an approximate version of Harper’s 


theorem. For n very large, the distribution of Hamming distances looks like a normal distribution with width ./n, so 


starting with a Hamming ball with € area can be thought of as the set of points below = on the normal distribution. 


Theorem 9.3 
For every € > 0, there exists a t > 0 such that for any subset A Cc {0,1}” of the Hamming cube with |A| > €2”, 


Aviad =e 


Proof. We're looking at a hypercube: pick a random vertex x in {0,1}”" uniformly, and let X be the distance between 
xX and the closest point in A. By the triangle inequality, this is 1-Lipschitz, and this is informative because X = 0 is the 
same as saying x € A, which happens with probability at least €. By the Azuma lower tail inequality, the probability 
that X = 0 Is 


7 2 
Pr(X < B[X] -E[X)) < ep (-=5). 


This gives an upper bound on the expectation of X: E[X] < ,/2log (4) n, and now we use the upper tail estimate. 
That tells us that x shouldn't deviate too much: the probability x ¢ A, j, where t = 2,/2log (2), is 


»(o-ieOP) em (ern) 


(Rephrased, our variable is pretty large in expectation, and it is rarely very large.) So x is in Aty_ with probability at 


least 1 — €, as desired. 


This is actually a fairly general result, and we can go back and forth between the geometric and combinatorial 


interpretations of this statement. 
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Proposition 9.4 
Let t,€ > 0 be real numbers, and let Q be a probability space on which there exists a metric (such as the Hamming 


cube with the Hamming metric). Then the following are equivalent: 


1. (Approximate isoperimetry) For all subsets A C Q with Pr(A) > 3, then given a set 


At = {w : dist(w, A) < t}, 


we have Pr(A;) > 1—-e. 
2. (Concentration of Lipschitz functions) For all functions f : Q — R that are 1-Lipschitz — |f(x) — f(y)| < 
diste(x, y) — if we have a median m € R such that Pr(f < m) > 3 and Pr(f > m) > 5, then 


Prif >m+t) <e. 


Note that this is concentration around the median, not the mean. We'll soon see that these aren’t that different, 
though. 


Proof. First let’s show that (1) implies (2). Take the half of the probability space 
A={wEQ: f(w) < m}. 
This is at least half of our probability space by the definition of the median, and since f is 1-Lipschitz, 
f(w) <<m+tVwe At. 
Thus, 
Pr(f > m+t) <Pr(A:) <e 


by condition (1). 

The reverse implication (2) to (1) is not that hard either. We want to show that given any set A with half the 
space, its t-neighborhood consumes almost the whole space. The natural choice for our Lipschitz function f is the 
distance 

f(w) = dist(w, A). 


We pick m = 0, and now 
Pr(Ar) = Pr(dist(w, A) > t) <e, 


by condition (2). Now take the complement, Pr(A;) = Pr(f < t) > 1—€, and we've shown condition (1). 


This can be useful, because sometimes it’s more natural to think in terms of isoperimetry instead of functions (or 


vice versa). 


9.2 Results about concentration: median versus mean 


Let’s look at another form of concentration of Lipschitz functions: 
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Proposition 9.5 
If we have a 1-Lipschitz function f : {0,1}" > R (with respect to the Hamming metric), pick w ~ Unif({0, 1}"), 
and let X = f(w) be our random variable. Then for all s € R, t > 0, 


PO Seo cies ee ee au) 


We should think about as taking “either s or s+ t to be a median:” then one of the terms becomes a constant 5: 


and moving that to the other side gives the Gaussian-like tail for the other term. 


Proof. We'll apply Azuma’s inequality twice. Let 2 be the mean of X — s (we can always just shift the variable X so 


that s = 0). If uw <0, then by Azuma upper tail, 


t-— 2 
Pr(X <s)Pr(X >s4+t) <Pr(X >s4+t)=Pr(xk —s—uw>t—p) < exp (-G*) Se en 


and we're done. Similarly, if t — u <0, 


Prix <s)Pr(X > 542) <= PHX < s) = Pr X—e—-p ep cer mM eee 


and we're again done. So we're just left with the case where u > 0 and t—w> 0. 


In this case, by Azuma’s inequality (lower tail), we can say that 
Pr(X < s)=Pr(X —s—u< —p) 
and since X — s — uw is mean-zero and Lipschitz, this is at most e-#’/(2") On the other hand, by Azuma (upper tail), 
Pr(X >s+t)=Pr(X —s—p>t—p) < eft 4/20), 


(Be careful here: we can really use this for t — w > 0, but otherwise we can just repeat the argument the other way 


around by starting with an upper tail argument instead.) Putting these together, 


2 2 2 
pe +(t— p) t 

= >s4+t< Ne _ 
Pr(X < s)Pr(X >s5 t) <e0| oa | seo] a 


by convexity, which is what we want. 


The next few sections are essentially about how to interpret these ideas. One way is to think about A c {0,1}” 


as a subset of the Boolean cube. If Pr(A) = lA (uniform measure), then we have the following: 


Corollary 9.6 


Consider a uniform measure on the Boolean cube, and let t > 0. Then 


Pr(A) Pr(A;) < e ®/@4"), 


In particular, if A is at least half the cube, and we expand it by some c,/n, we get almost the entire cube: 
PAD Sere e™, 


Earlier in the class, we were using the mean for concentration and other concepts, but now we have the median 


84 


instead: what relations are there between the mean and median? Suppose that we have a bound of the form 
Pr(A;) < Cen (t/ey" 


for all A with Pr(A) > 3. (In this case, o = \/n.) Given any 1-Lipschitz function f, and letting X = f(w) (where w is 


random in (2), if we have a median m of X, then the difference between the mean and median is 
|E[X] — m| < E|X — ml = os Pr(|X — m| > t)dt. 
If we have sub-Gaussian tail bounds, this is 
< ie 2Ce(t/e? = C/o, 
0 


so the mean and median don't differ by more than a constant times o. This is actually the tightest bound we can 


produce: consider the function f : {—1,1}" > R defined by 


F(xXi,0+° Xn) = [Xp $s + Xa. 


E[X] med X 
i and er 
where Z is the standard normal — those are different constants! So the idea is that in general, we have 


We can evaluate its mean and median by the Central Limit Theorem: converge to E|Z| and med|Z|, 


PHR=-EXi ene Cee, 


and we have concentration around the mean as well - just possibly with a worse constant than the median version. 


9.3. High-dimensional spheres 


Most of our intuition about high-dimensional geometry is wrong! A good reference is Keith Ball’s “An elementary 


introduction to modern convex geometry.” 


Theorem 9.7 (lsoperimetric inequality in R”) 


If A,B CR’, Bisa ball, and A(A) = A(B) (they have the same measure), then 
Ar) = A(Br) 


for all t > 0. 


Basically, this is asking for the “smallest perimeter’ among all sets of the same volume. We saw a version of this 
earlier by Harper: if A,B Cc {0,1}" are subsets of the Hamming cube, |A| = |B| and B is a Hamming ball, then 
|A:| > |Be| for all t > 0. 

Here’s the most conceptual way to think about it: this is called Steiner symmetrization (alternatively shifting or 
compression) in the discrete case. The idea is to transform A to preserve the volume and decrease its perimeter. Cut 
it in half so we have half the volume on each side: if one side has smaller perimeter, then lose the worse side and 
reflect over the cut. If we can't do this, then every cut must cut the perimeter in two: then just show that such a 
shape must be a ball. 

This isn’t actually a proof though: we might need to do this infinitely many times, so there are some compactness 
issues with this idea. For the discrete setting, we just keep compressing our shape in some direction. 


It turns out there’s also a spherical isoperimetric inequality: 
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Theorem 9.8 (Levy) 


On the unit sphere S"~+ C R", use the arc distance (though it doesn't really matter). Then given two subsets 


A,B c S"~!, where B is some spherical cap and A(A) = A(B), we have A(At) > A(Bz) for all t > 0. 


This isn’t easy to show, but remember that approximate Isoperimetry is connected to concentration of measure. 
The counterintuitive thing is that distribution of measure is very different in high dimensions than in our ordinary 3-D 


space. 


Fact 9.9 


The volume of a t-neighborhood of a hemisphere C is almost everything: 
PRG ee 


where nis the number of dimensions. 


We should not think of an n-dimensional shape as a very ball-like object: distribution of mass looks normal along 


any axis, with standard deviation Fi Also, most of the mass is near the surface rather than the middle! 
Since there's a spherical isoperimetric inequality, we should also have an analogous statement about Lipschitz 


functions for the sphere: 


Proposition 9.10 
There exist absolute constants c,C (not dependent on n) such that given a function f : S’~1 + R that is 
1-Lipschitz, 


Pr(|f — E[f]| > t) < Ce~*”. 


This can be rephrased as “every Lipschitz function is nearly constant nearly everywhere.” 
For cultural value, here’s one more space that’s good to mention: the Gauss space in R” has the Euclidean metric, 


and we have the probability distribution 


Z=(Z1,°+: Zn): Z) ~ N(0,1) 


(with all Zs independently distributed). We again have an isoperimetry theorem: 


Theorem 9.11 
If A,B CR’, and B is a “ball” with Pr(A) = Pr(B), then 


Pr(At) > Pr(Bz) Wt >0. 


A “ball” in Gauss space is intuitively supposed to look like a sphere of radius \/n, because the probability density 
function is 
fn(x) = (20) 7" XP /2. 


The nice thing here is that Gaussian vectors are rotationally symmetric, and now the length of this vector can be 
written more simply: 


Eales ge eee eed 
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The expectation of |Z|? is n, and along with the spherical isoperimetry inequality, we now have a way to describe balls 


in the Gauss space: B should be some half-space. (It can have measure not equal to $ if we don't have the boundary 


of that half-space passing through the origin.) 


9.4 Projections onto subspaces 


The idea with our next section is that we want to represent a bunch of points in a smaller dimension without distorting 


the distances too much. 


Theorem 9.12 (Johnson-Lindenstrauss Lemma) 


Let s1,--- , Sy be points in R?. Then there exist 541 55,-+-,5y € R™, where m= O(e7? log N), such that 


(1—e)ls;— 51 < ls; — $l < (1 + 2)I5; — 1. 


So we can approximately preserve distances up to a small multiplicative error. 


Proof. Pick a random (orthogonal) projection onto an m-dimensional subspace (chosen uniformly at random). This 
projection is actually agnostic to the set of points s1,--- ,sy. We claim that with positive probability, the desired 
outcome occurs. If we do this naively, everything gets smaller, so we'll scale by ,/7 to correct for that. Basically, our 


claim is that all the length ratios are generally preserved. 


Lemma 9.13 


Let P be a projection from R” onto a random m-dimensional subspace, and let z € R” be some fixed vector. If 


we let z’ = Pz be a random variable, then E[|z’|?] = 2|z|*, and we have 


(=e) Meisels ten) el 


with probability at least 1 — e7cem, 


Proof of lemma. Note that fixing our vector and picking a random subspace is equivalent to fixing our projection P 
and choosing a random unit vector in R”. By rotational symmetry, we can make P the span of the first m basis 


elements {€,,--- ,@m}, and thus any z = (Z,,--- ,Z,) corresponds to z’ = (Z,--- ,Zm,0,--- , 0). 


Note that E[|z’|?] = E[z? +---+z2]: in general, it’s easier to look at squared lengths than lengths. By symmetry, 


because all the Zs have the same expectation, each z has expected value + (using linearity), so 


m 
2’? = Bla +---+ 2m) = 7. 


But now note that projection z — |z’| is a 1-Lipschitz function, so by Levy concentration (the isoperimetric inequality 
on the sphere), we know that 


Pr ( \z'| — yall > ea) < exp [-cn- = 7] = exp[—me?]. 


(remember that mean and median are reasonably close because of Gaussian tails), which is exactly what the lemma 


claims. 
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So now we can finish with a union bound: since everything happens with high probability, we can just say that the 


probability some pair (/,/) fails the distance check is at most 


eNom e 1 


as long as m is chosen to be O(e~? log N), and we have the desired result. 


9.5 What if we need stronger concentration? 


Unfortunately, Azuma's inequality is not enough to solve all of our problems. Consider the following: 


Problem 9.14 
Let V be a fixed d-dimensional subspace (through the origin), and we pick a point X ~ Unif({—1, 1}”). How well 


is the Euclidean distance dist(x, V) concentrated? 


We have n independent Boolean variables, so we have a Lipschitz function of x, which gives ./n concentration of 
our random variable X by Azuma’s inequality. In particular, the probability that our variable is within t./n of its mean 
decays like a Gaussian in f. 

But the diameter of the cube itself is proportional to \/n, so this is a pretty bad estimate! 

Note that we can change the problem a bit, and Azuma does become pretty good. Specifically, if we pick V 
uniformly at random (then x can be either a fixed point or chosen randomly - it doesn’t matter), Azuma’s gives 
pretty good concentration. Alternatively, we could pick x uniformly at random from the sphere that goes through the 
vertices of the Boolean cube as well, and Azuma still yields reasonable concentration. These are the same by rotational 
symmetry, and in the second case, we're asking for the concentration of a Lipschitz function on a sphere. We know 


then that if f is Lipschitz on a \/n-radius n-dimensional sphere, then 


Pr(|f —E[f]| >t) < Ce". 


So if we can get O(1) concentration on the sphere, intuitively we should also be able to get it on the Boolean cube 


as well. We just haven't been able to do this with the methods introduced so far. 


9.6 Talagrand’s inequality: special case 


As often happens with Euclidean distances, it’s hard to calculate the mean of X, but analyzing X? is much easier. Let 
P be the projection operation onto the orthogonal complement of V, our d-dimensional subspace: then P is some 


matrix € R"’%™, and we have 
X? = (X, PX) =) xix Di. 
Uy 


Since the x;s are orthonormal, this just leads us to 


D(X? ] » Pi trP =n—d. 
i 


Notably, this expectation of X? does not depend on the orientation of V, though the distribution of X? does. So we 
should expect X to be concentrated about /n — d, and that gives us a center to work with. We're trying to claim that 
we have O(1)-concentration; specifically, we'd like to show that there is exponential decay with a constant deviation. 


To do this, we finally introduce the inequality we want: 
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Theorem 9.15 (Talagrand’s inequality, simplified) 


Let A Cc R” be a convex subset, and let x be a uniform random point in the Boolean cube 
x ~ Unif({0, 1}”). 


Then for all t > 0, 
Pr(x € A) Pr(dist(x, A) >t) < e?/4, 


where we use the Euclidean distance. 


Convexity here is extremely important. Talagrand is just not true otherwise - for example, consider A to be just 
the set of points in {0, 1}” with weight (sum of entries) at least 5. Then a random vertex is generally O(./n) away: 
specifically, there is probability at least 7 that the weight of x is at most 5 — c/n for some c. 

Then the Euclidean distance is the square root of the Hamming distance on the Boolean cube, so the distance 
from x to A is on the order of nls which is not constant. 

So what's really going on with this inequality? Given a convex set - for example, the convex hull of those same 
points in our Boolean cube - we're now measuring the distance to possibly some convex average of our vertices, and 


that distance is generally much smaller than if we were only allowed to use the vertices themselves. 


Definition 9.16 
Define a function f to be quasi-convex if all sets {f < a} for a © R are convex. (All convex functions are 


quasi-convex as well.) 


Corollary 9.17 
Let’s say we have a function f : R” —> R that is quasi-convex and 1-Lipschitz with respect to the Euclidean 
distance: then for all r € R, t > 0, for x picked uniformly from the cube {0, 1}”, 


Pr(f(x) <r) Pr(f(x) >rtt)<e th. 


This is a direct translation of the isoperimetric inequality. The theorem implies the corollary by letting A be the set 
of values {f < r}, which is convex if f is quasi-convex by definition. Since f is 1-Lipschitz by the triangle inequality, 
we have 

f(x) <r+t Vx: dist(x, A) < t. 


With this, we're now ready to answer our initial problem: 
Theorem 9.18 


Let V be a fixed d-dimensional subspace, and let f(x) = dist(x, V). If we pick x uniformly on the cube {0, 1}”, 
then there exist constants C,c > 0 such that for all t > 0, 


Pr(|f — E[f]| > t)< Ce. 


Proof sketch. Let m be a median of f. Using Corollary 9.17, set r = m to get the upper tail 


Pr(f > m+t) <2e8/4. 
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Meanwhile, set r = m— t to get the lower tail 
PHP et) < oe", 


We also mentioned that the median and the mean are very close for sub-Gaussian distributions. With some calculations, 


we can show that the median of X 


med(X) = Vn—d+O(1) 


(with an absolute constant), or else we get inconsistency with tail bounds. Since E[f?] = n—d and we have constant 


concentration, E[f] is also Yn — d+ O(1), and thus we have constant deviation from the mean, as desired. 


So the whole point is that Talagrand is about concentration of convex Lipschitz functions when evaluating 
at a random point of the Boolean cube. We're not going to prove the inequality in class, because there are some 


tedious calculations involved. Instead, let’s focus on combinatorial applications. 


9.7 Random matrices 


Let A be a random symmetric matrix with independent entries +1, where ajj = ajj. This can be thought of as being 
related to the adjacency matrix of a random graph. 


It turns out the largest eigenvalue Az Is also the operator norm of A: 
Ai (A) = IAllop- 


How well is this concentrated? We have about O(n?) variables, so Azuma’s inequality gives something like O(n) 


concentration about the mean. But this is pretty bad, because typically the largest eigenvalue 
AVA) S Vn, 
so linear concentration doesn't really help at all. On the other hand, let’s try to use Talagrand’s inequality. We need 
to check a few things: consider the function f : A > ||Allop. 
+ Convexity comes from the fact that the operator norm is a norm, so we can use the triangle inequality. 
* To show this function is 1-Lipschitz, we need 


lf(x) — Fy) S Ilx — ylle. 


where we're using the L? norm. This can be proved using Cauchy-Schwarz. 


So now Talagrand’s inequality tells us that we have constant-window concentration, independent of n. In other 
words, we've just showed that 
Pr(|A1(A) — E(Ax(A))| > t) < Ce? 


for some C, c, which decays like a Gaussian. 


Fact 9.19 


We actually know more about the concentration: it’s actually @(n~1/°), and it converges to something called a 


Tracy-Widom distribution when normalized. Also, we know the mean of this distribution: the easiest way is to 


make the entries Gaussian instead of +1, but the answer is approximately /2n regardless of the distribution. 
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As a sidenote, we can't actually use this method to prove the concentration of the second largest eigenvalue yet, 
since that's not convex as a function of our matrix entries. But the bottom line is that Talagrand’s inequality is not 


just about the Boolean cube. 


9.8 Talagrand’s inequality in general 


If we have a space 2. = Q; x --- X Qn, we may want to find the distance between two points. 


Definition 9.20 


Given a vector @ = (Qj,-+: , Qn) € RSo, define the weighted Hamming distance 


do(x, Y) = Sy Qj. 


XA; 


This kind of distance is defined even if the individual Q;s don’t have metrics! So now if we have some subset AC 2 
of our product space, we can define 
d(x, A) = inf dy(x, y). 
yEA 


This should still feel fairly familiar: for example, for Q = {0, 1}", if we have a fixed a with |a| = 1 (under the L? 
norm), we have 


do(X, ¥) = |(, Ley) 


Azuma's inequality then tells us that for if we choose x uniformly on {0, 1}”, if we have a fixed a and subset A c {0, 1}”, 


Pr(|da(x, A) — E(da(x,A))| > t) < 26°77. 


(This is because the weights in Azuma’s inequality satisfy >> Gr = 1 from the definition of a.) But Azuma only gives 
this to us for a fixed a: having this condition be true in all directions is much stronger, and that’s what Talagrand’s 


inequality tells us. 


Definition 9.21 


Define the convex distance 


dr(x, A) = sup dg(x, A). 
aeR" 


lal=1 


(Basically, choose the “worst” possible a: the one that separates x and A by the most.) This is easier to visualize if 
we think of Q = {0,1}” and A Cc Q being a subset of that Boolean cube: d(x, A) is then just the Euclidean distance 
from x to the convex hull of A. (In general, we do not need each coordinate to be limited to {0,1}, though: they can 


take on any set of values.) 


Theorem 9.22 (Talagrand’s inequality, general) 


For anyACQ=Q1 X--- X Qn, let x be a random point in Q with independent coordinates. Then 


Pr(x € A) Pr(dr(x, A) > t) < e7*. 


Here's another interpretation of the convex distance: we want to convert this to a “distance to convex hull” type 


argument even when we don’t have a Boolean cube. 
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Definition 9.23 
Define U,(x) to be the set of s € {0,1}” such that there is some y € Aso that 5; = 1 for all / with x; 4 y;. In 


other words, s € Ua(x) if the support of s contains the support of x — y for some y € A. 


We can think of this as the “set of coordinates we need to change to get from x to A,” ignoring the actual 
coordinates: notice that this is a subset of the Boolean cube even when A isn't. Notably, this is an increasing subset 
of the cube {0, 1}”. 


Lemma 9.24 
Letting dist be the Euclidean distance, 


dr(x, A) = dist(0, convex hull(Ua(x))), 


Proof. The left hand side is (by definition) the supremum over all weight vectors a of norm 1 of the a-distance 


between x and A, which is equivalent to looking at the closest point in A under this a: this is then 
sup inf d,(x, y) = su inf (a-y). 
up inf, a(x, y) up jes: y) 


Since A is convex, by von Neumann, we can swap the inf and sup as long as we extend to the convex hull 


= inf sup(a-y)= _inf yl, 
yeconvex q y€convex 
hull(U,(x)) hull(U,(x)) 


as desired. 


So how do we apply Talagrand? The idea is that we can adjust our a@ to favor certain coordinates and give us 
better bounds. a plays the role of a “certificate” that guarantees the existence of a small or large value. In particular, 


it follows from Talagrand that (rearranging) 


—t?/4 


Pr(dax)(x, A) > t) < 


= Pr(A)° 


Specifically, we can pick a different certificate @ for each x: 


Corollary 9.25 
Let A, B be subsets of Q = Qy x Qo--+ X Qy. Suppose that for all y € B, there exists an a = a(y) € Rp so 
that for all x € A, 

da(x, y) > tial. 


(This means the distance between A and B is large in the specific Talagrand sense.) Then 


Pr(A) Pr(B) < 7? /4. 


To understand this, let's do another proof of the largest eigenvalue of our random matrix: 


Proof. Let X be an n x n symmetric random matrix with independent entries in the interval [—1,1] (they can be 


distributed in any way, as long as they are independent and bounded). If we let t > 0, M € R, and we have the sets 


A= {X:(X) <M}, B={X 2 A(X) > M4 th, 
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we want to verify that for every matrix in B with large eigenvalue, we can certify this somehow: we pick some a(y) 
such that B is far away from A. Specifically, there exists some a € R™, where m = Matt) such that da(x, y) > ctla| 
for all x € A. 


Let 7 € R” be the top eigenvector corresponding to Ai(y). Then let 
v2 i=j 


2lvillyl 1 AL: 


Qi = 


the reason for doing this will become quickly apparent. By the Courant-Fischer characterization of the top eigenvector 
Vv, 
vT¥v=A(Y)>M+t, 

and because X does not have large eigenvalue, we can set a contrasting bound for Xx: 

vi Xv < A(X) <M. 
In particular, this means that we can use our eigenvector v to “separate” A and B: 

t< vi (xX —Y)v, 

and expanding out the difference as a bilinear form, 

f= > VO a) 

i] 

This is upper bounded by looking at only those where the two matrices differ in their entries: 


< 25° vill VjlLx4y, < 2da(X, Y). 


ij 


(Here, we used that the entries of X and Y are bounded by [—1, 1].) Now note that the length of @ is at most 2, and 


plug this into the corollary to get concentration. 


9.9 Increasing subsequences 


Problem 9.26 


Pick a uniformly random permutation o € S, of the first n integers. How long is the longest increasing subse- 


quence? 


Call this length X - it’s important to note that we can skip entries (so subsequences don’t need to be contiguous). 
For example, 5,3, 1,4, 6,2, 7 has longest increasing subsequence of length 4. 

Our goal is to show that X is concentrated. Let's try to use the tools we have: first of all, let’s try Azuma. We 
need independence of our underlying variables, so let's try to make our Qjs independent: let x1,--+ , xX, ~ Unif[0, 1] 
independently, and get a permutation of [n] from the relative orderings of the xjs. 

Then the length of the longest increasing subsequence changes by at most 1 if we change 1 coordinate, so It is 


1-Lipschitz here. Azuma tells us that we have sub-Gaussian decay with a window size of O(,/n). 
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How good Is this? Let’s do a first moment calculation to see the average size of X: 


k 
Pr(X > k) < (Z) as aa? 


since we pick any of the (7) sequences of length k, and they have probability ma of working. 


So if k = 100,/n, then this probability is o(1), and thus we should expect the permutation to be no more than 
c/n long. That means our concentration bound is bad! (In particular, any permutation of length n has either an 
increasing or decreasing sequence of length \/n by Pigeonhole.) 

Let's see if Talagrand tells us anything better. The idea is that Talagrand is useful when we can “witness” rare events: 
showing such a sequence exists (that is, making the length certificable) doesn’t use that many of the coordinates of 
X. So Talagrand will actually tell us that we have fluctuations on the order of O(,/x). 


Here's that idea in more rigor: 


Theorem 9.27 
Let Q = Qy, x +++ x Op, and let f : Q — R be a 1-Lipschitz function with respect to the Hamming distance. 


Suppose that we can verify 
{w: f(w) > r} 


by checking at most s coordinates. Then for every t, 


PRR (un) co ta/S PRP (an) ni en 


When we say checking at most s coordinates here, we specifically mean that with any w with f(w) > r, there 
exists some subset / Cc [n] with |/| < S such that for all other w’ such that w agrees with w’ on /, f(w’) > r. In other 


words, knowing those s coordinates guarantees that our condition is true. 


Proof. Let A, B be the sets 
A:{w:f(w)<r—ty/s},B:{w: f(w) > r}. 


Our goal is to check that for all y € B, there exists a € Roo such that for all x € A, da(x, y) > tla|. (Basically, A is 
far away from B, even if we zoom in on /.) 

But by definition of checking coordinates, there exists a set / C [n] with |/| < s for each y € B. Here's the 
key: if we fix y and let a = 1; (1 in the spots of / and O in the others), every x € A disagrees with y on at least 
t,/s coordinates of / (or else we could change x by less than t./s coordinates and get x’ to agree with y on /, 


meaning f(x’) > r). This means that the weighted Hamming distance dg(x, y) > t,/s > tla|, and now we can appl 
g g g y Pply 


Talagrand’s directly. 


Corollary 9.28 
Given a 1-Lipschitz function on f : Q=Qy x +--+ X Qn > R (with respect to the Hamming distance), if {f > r} 


can be verified by checking r coordinates, and m is a median of X, then for all ¢, 


t? 
Pr(X <m—t)<2 = 
(xX <m-=t) <2e@(-7), 


t2 
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Proof. From the above theorem, renormalize t by a factor of \/r: now we have 


Pr(f <r—t)Pr(f>r) se f/@, 


Setting r = m gives the lower bound, and setting R = m+ t gets the lower bound. 


So now this applies directly to X for our increasing subsequence, since we can “witness” our event by just showing 


the subsequence itself. Note also that the median of X is O(,/n), so now we know that 


Pr(|X — E[X]| < s) =1-— (1) if s>> n™“, 


meaning we've found \/n-concentration. 
But this is not the best possible result! In 1985, Vershik-Kerov showed that X is concentrated around 2,/n, and 
in fact, the limiting distribution was found by Baik-Deift-Johansson in 1999 to be 


X-2/n 


16 — Tracy-Widon distribution. 
n 


(As we may remember, this is also the fluctuation of the top eigenvalue of a random matrix.) 
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10 Entropy methods 


10.1 Information entropy 


We're going to shift away from concentration results now. This next concept was essentially invented by Shannon, 


and we'll focus on its combinatorial applications. 


Definition 10.1 


Let X be a discrete random variable taking values in some set S. Then the entropy of X is 


H(X) = S° —ps loge ps, 
ses 


where ps = Pr(X = s). 


Intuitively, entropy is supposed to measure the amount of randomness or information in the random variable X. 
Because we're doing combinatorics, we'll work with base-2 logarithms - this is really more of a convention than 


anything else, and all logs in this section mean base 2. 


Example 10.2 
The entropy of a Bernoulli variable Ber(p) is just —plogs p — (1 — p)logs(1 — p), which has a maximum of 1 at 


Pp=>5- 


Basically, this tracks how “surprised” we are when we hear a sample from the distribution. This idea essentially 
comes from trying to encode messages efficiently: for example, if a coin only comes up heads 1% of the time, encoding 


it as a binary string directly is not the most efficient way. 


Lemma 10.3 


H(X) < logs |range(X)|. 


Proof. This is convexity of the function x > x logp x. 


Equality holds when we have the uniform distribution: then H(X) tells us the number of binary bits needed to 
specify which choice of X we pick out. 
Denote by H(X,Y) the entropy of the joint random variable Z = (X,Y), where X and Y are not necessarily 


independent. This means we have 


H(X,Y) = $5 —Pr(X =x, ¥ = y) logy Pr(X =x, Y =y). 
(x,y) 


Lemma 10.4 (Subadditivity) 


Given any two random variables X,Y, H(X,Y) < H(X) + H(Y). 


Proof. Expanding H(X) + H(Y) — H(X, Y) out, this gives 


H(X) + H(Y) — H(X,Y) = $2 (=p(x, y) logs w(x) — p(x, y) loge p(y) + w(x, y) logs p(x, y)) = $5 p(x, y) logs — 
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Let f(t) = tlog t, which is a convex function. Then by Jensen’s, we can bound this as 


= pede (Aer 7) > f(1) =0. 


Basically, there’s at least as much information in X and Y individually as when we put them together. 
H(X) + H(Y) — H(X%Y) = 1(XY) 


is called the mutual information, and it’s always nonnegative. 
In particular, if X and Y are independent, then H(X, Y) = H(X)+H(Y) . In this case, the amount of information 


in our variable X is just the sum of the individual parts. 


Corollary 10.5 


For any random variables X1,--- , Xn, 


HA Xn) Ss H(X1) ala se als H(Xn). 


There's also a notion of “conditional entropy:” let E be an event with positive probability, and then we have 
H(X|E) = S°— Pr(X = x|E) logy Pr(X = x|E). 
x 


What's really important to us, though, is when we condition on a second random variable: if X and Y are jointly 


distributed, we define 


H(X|Y) = Ey [H(X|Y = y)]. 


Essentially, this is how much new information we get given a certain piece of information about Y. 


Lemma 10.6 (Chain rule) 


For any random variables X,Y, H(X|Y) = H(X,Y) — H(Y). 


Proof. 


H(X|Y) = Ey [H(X|Y = y)] 
= SoPry = y)H(X|Y = y) = S$) p(y) 3S plxly) loge Ply) 


y 


= S5-p(x, y) logs p(x, y) + S5 p(x, y) loga(y) 


= S>-p(x, y) loge p(x, y) + S5 p(y) loga(y), 


XY 


where the first equality follows from Bayes’ rule and the last because )-, p(x, y) = p(y). 


In other words, the conditional entropy is just the total entropy minus what we “already knew about Y.” In particular, 
if X =Y, or if X = f(Y) (so we know X given Y), the conditional entropy is 0. On the other hand, if X and Y are 
independent, the conditional entropy is just H(X). 
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Lemma 10.7 (Dropping conditioning) 


For any random variables X,Y, Z, H(X|Y) < H(X) and H(X|Y, Z) < H(X|Z). 


Proof. These follow from the chain rule (Lemma 10.6) and subadditivity (Lemma 10.4). For example, 


H(X|Y) = H(X,Y) — H(Y) < A(X). 


10.2 Various direct applications 


Let’s start to see how this can be useful! Entropy’s use primarily comes up in tail bounds. Here’s a philosophy: we 
want to show an upper bound on some quantity, so we start by taking the log of both sides. The left side is the log 


of some quantity, so take a uniform probability distribution on the things we want to count: we now have an entropy. 


Theorem 10.8 


Let F bea collection of subsets of [n], and let p; be the fraction of subsets in F that contain the element /. Then 


logs |F| < > H(e), 


i=1 


where H(p) is the binary entropy of the Bernoulli variable 


H(p) = —plogg p — (1 — p) loga(1 — p). 


Proof. Let X = (Xi,--- , Xn) be the characteristic vector for a uniform random element F € F: this means that X; 
is 1 if / € F and O otherwise. The entries aren't necessarily independent here, so we can play with this with entropy: 
logs |F| is just H(X), because we have a uniform distribution (this is the equality case of Lemma 10.3). 

By subadditivity, this is at most H(X1) +---+H(X,). Each X; is a Bernoulli random variable with probability p;, 


which is what we want. 


Theorem 10.9 
et = Ss When 


Proof. Let X = (X1,---,Xn) € {0,1}”" be the uniform random vector conditioned on X; +---+ X, < k. By 
Lemma 10.3, the logarithm of the left hand side is H(X), and by subadditivity, this is at most H(X1) +---+ H(X,). 


Conditioning on the sum of the X; being exactly m, each X; is a Bernoulli variable with probability 7. Since the sum 


is always at most k, we can say that X; is Bernoulli with probability at most x Since this is less than 4 by assumption 


and the entropy of a Bernoulli increases until p = 1/2, we have that H(X;) < H (*). 


Now there are n copies of this term, and rearranging gives the result. 


We get a similar result if we don’t pick everything with probability $ but instead with probability p: then we get a 


relative entropy called the Kullback-Leibler divergence. 
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Theorem 10.10 (This was problem 32 from our problem set) 
Let S1,--- , Sx be subsets of [n], and suppose that for every pair of distinct subsets A, B C [n], there exists an / 
such that 

IS;N A] A |S; BI. 


Then k > (2— OC) eam 


This is called a coin weighing problem, because we can imagine that we have two types of coins, where one is a 
little heavier than the other. We can then weigh k times, and we want to be able to tell how many counterfeit coins 


we have. Well, if there always exists an / that distinguishes them, then we know exactly which coins we want. It turns 


2n 
logy n 


The main idea here is that there’s some information that we're gaining on each comparison S;: can we get enough 


out we need at least = 


weighings to do the job. 


to deduce the set of coins? 


Proof. Let X be a uniform random subset of [n]. Since there are 2” different possibilities that are uniformly weighted, 
the entropy of X is just n. Observe that X contains the same information as the sizes of all |X 9 S;| for 1 <i <k: in 


particular, this is an injective map, since no two subsets have the same set of intersections. By subadditivity, 
A(X) = HX Si, -++ [XO Sel) S AUX 9 Sil) +--+ + AX Sel). 


Because X is a uniform subset of 1 through n, |X 1M Sj| is binomial with distribution Bin (|Si|, 5). The entropy of such 


a binomial distribution is bounded by logs |Sj|, and |S;| <n, so this gives 
n=H(X) < klogsn, 


which is enough to give everything except for the factor of 2. 
However, note that the binomial distribution is not uniform: it’s highly concentrated, and thus we should have 
much less entropy than a uniform distribution! Heuristically, we know that the binomial distribution is concentrated in 


a y/|Sj|-interval, so the entropy should be essentially related to logs(|Sj|). This turns out to be true if we work out 


H (sin (isi 5)) < (5 + 0(1)) cao 


and now rearranging gives the result that we want. 


the calculations, and that gives us 


As a sidenote, the actual entropy of the Binomial distribution is 5 logs m+ O(1). 


10.3. Bregman’s theorem 
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Definition 10.11 


The permanent of an n x n matrix is 


per A= S- II Ghai 


oESn 


In contrast, the determinant is similar but includes a sign: 


det A= sgn(c) II Fhe fene 


oES, 


These are very different quantities - in particular, the determinant is believed to be much easier to calculate. 
Let’s only consider matrices A c {0,1}"*": any such matrix can be encoded by a bipartite graph with n row nodes 


and n column nodes, where row / and column / are connected if and only if there’s a 1 in the corresponding entry. 


Lemma 10.12 
The permanent of a matrix A C {0,1}"*" is equal to the number of perfect matchings in the corresponding 


bipartite graph. 


Proof. The permanent expands over all permutations, and we count a permutation if and only if every edge we try to 


use exists (giving us a product of 1). 


So here's a natural question to ask: if we have some degree distribution (for example, d-regular), what is the 
maximum number of perfect matchings that are possible? One possible extremal graph is a union of complete bipartite 
graphs: in the d-regular case, the number of perfect matchings is just d! to some power. Is this the best we can do 


in general? 


Theorem 10.13 (Bregman) 


Given a matrix A € {0,1}"*" whose i** row sums to d; for all i, 


n 
per A< care 


j= 


Note that a disjoint union of complete bipartite graphs Ks; gives the equality case. 


Proof by Radhakrishnan. Let o be a uniform permutation of [n], conditioned on all Aj,¢, being 1 in our matrix. In 
other words, we are picking a uniform random perfect matching! By Lemma 10.12, H(a) = logs(perA). 

Attempt 1 (Subadditivity): o has n different coordinates, one for the entry a; picked in each row, so each 
coordinate is a random variable. As we've done in the previous examples, we can try to apply subadditivity here, 
bounding the entropy for the i*" coordinate by H(o;). If there are d; 1s in that row, we can say that H(0;) < logs d; 
(we may not have equality because we don't have a uniform distribution on the o;s). Unfortunately, this is not enough! 
Because the oj are not chosen independently (e.g. picking something in the first row affects the others), applying 
subadditivity directly costs us a lot. 

Attempt 2 (Randomization + chain rule): Instead, let's reveal the rows in a uniform random order and then 


apply the chain rule. If 7 is a uniform permutation in S,, we now have 


H(c) = H(o7,) + H(07,|O7,) a eae H(07,|Or, a OH 4) 
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Take expectations on both sides. We know that the left hand side is independent of the ordering - at the end of the 


process, we still see all the rows, so the information we get is the same. Thus, 


H(o) = E[H(o7,)] + E[H(or,|on)] + +--+ ELH (Ar |Ons- ++ Om1)]:- 


What's the contribution of the /** row of our original matrix to this sum? If the row appears in the k*" term of the 


sum, then it contributes E[H(a;|---)], where --- represents a uniform subset of k — 1 other rows. Then, 


E[H(o;|---)] < Eflogs(number of available entries in row /|--- )]. 


Since we only care about the ordering of the d; rows whose entries conflict with the dj 1s in row /, and each ordering 


is equally likely, , ‘ 
= —(logo 1+ logs 2+--- + logs d;) = — logo(d)!). 
dj d 


i 


Plugging this back into the sum, 


s[H(o)] <)> = log, (ai!) 


and exponentiating both sides yields the result. 


10.4 A useful entropy lemma 


Lemma 10.14 (Shearer's lemma (special)) 


For any random variables X, Y, Z, 


2H(X, Y, Z) < H(X,Y) + H(X, Z) + HY, Z). 


Proof. By the chain rule, 
H(X,Y,Z) = H(X)+ H(Y|X) + H(Z|X,Y). 


Now, add up the following: 
H(X, Y) = H(X) + H(Y|X) 
H(X, Z) = H(X) + H(Z|X) 
H(Y, Z) = H(Y) + H(ZIY). 


Dropping conditioning on H(X,Y, Z) yields the result. 


What are some applications of this? 


Corollary 10.15 
Given a finite set S C R°, consider the orthogonal projections Txy(S) onto the xy-plane (and similarly for the xz 


and yz-planes). We have 


|S|? < Txy(S)Mxz(S)tyz(S). 


Equality holds for a Cartesian box. 
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Proof. Let (X,Y, Z) be a uniform point in S. Then logs |S] is the entropy H(X, Y, Z), and by Shearer, this entropy is 
at most 
2 logs |S| < H(X,Y) + H(X, Z) + H(Y, Z). 


The shadow distribution doesn’t need to be uniform, but we can upper bound its entropy with that of the uniform 


distribution: 
< logs Txy(S) + logs Txz(S) + logs tyz(S). 


Taking 2 to the power of both sides gives the result. 


Remark. We can actually get the same result for a volume S C R?: the volume of S squared is at most the areas of 


the projections onto the planes. This can be proved by approximating S as a union of grid boxes! 


Let’s now look at Shearer’s inequality in its general form. 


Theorem 10.16 (Shearer's lemma, general) 


Let A;,--- ,As © [n], where each / € [n] appears in at least k different Ajs. Let X1,--- , Xp, be random variables 


and define the joint random variables 
Xa; = (Xi)ie4;- 


kH(X1,0++ Xn) S H(Xa) to + (Xs) 


In the special case (Lemma 10.14), Ay, Ao, and Ag are just the two-element subsets of (1,2, 3). The proof of the 


general case is the same! Let's establish a corollary analogous to Corollary 10.15: 


Corollary 10.17 
Let Ai,--- ,As € , where each / € 2 appears in at least k different A;’s. Then for every family F of subsets of 
Q, 


Ss 
IFIK < [LIF lal. 


j=l 


where the notation F|,4, means F restricted to the elements of Aj; {SM A: S € F}. 


Proof. Let (X1,--- ,Xn) € {0, 1}” be the indicator vector of a uniform random F € F. Then 


§ 
k logs |F| = KH(X1,--+ Xn) < So (Xa). 


j=l 


Again, we can upper bound by the uniform entropy: 


5s 
k logz |F| < S~ loge |Flajl, 


j=l 


and exponentiate both sides to get the desired result. 


Let's use this for a combinatorial application: in particular, what was the problem that inspired this inequality? 
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Problem 10.18 (Easy) 


What is the largest intersecting family of subsets of m elements, where “intersecting family” means every pair has 


a nonempty intersection? 


The answer is 2-1: we can just pick every subset that contains the element 1. This is maximal, because any 
set A and its complement [m] \ A can't both appear. If we look back to the beginning of class, the original problem 


restricted us to only k-element subsets; without this restriction, the problem is easy. 


Problem 10.19 


What is the largest set of graphs on n labeled vertices so that every pair has a common triangle? 


We can get 4 of the total: fix a triangle, and pick all graphs containing that fixed triangle. We also know that it’s 


less than 5 of the total, because we can't pick both a graph and its complement. 


Theorem 10.20 (Chung-Frankl-Graham-Shearer) 


Every triangle-intersecting family of graphs on n labeled vertices has at most 2()-2 elements. 


Proof. Let G be a triangle-intersecting family on n vertices. Notice that if we restrict ourselves to half our graph and 
look at the shadow on the two cliques, we must still have an edge-intersecting family, because what's left is a complete 
bipartite graph. 

More concretely, let m= (5). Pick a subset S C [n] with |S| = |], and let As be the union of cliques on S and 
[n] \S. Kn \ Ais triangle free, so G|4 must be intersecting. This means that |G|,| < 2!45!-1 by the logic above: now 
if we look at all possible Ss, each edge of K, appears in k different Ass, where k = aa (|n/2] ), r=|As|. 


Now by Shearer’s lemma, 
Ig < J] last = (2-2)? 


This simplifies to |G| < Qm— where @ is the inverse of the edge density, and since 7 > 2 for all n, this yields the 


desired result. 


What's the truth, though? 


Theorem 10.21 (Ellis-Filmus-Friedgut, 2012) 


Every triangle-intersecting family of graphs on n labeled vertices has at most é : a) elements. 


The proof of this more refined result uses Fourier analysis! 


10.5 Entropy in graph theory 


Problem 10.22 


Among all d-regular graphs G, how can we maximize the quantity 


ia oe 


where /(G) is the number of independent sets and v(G) is the number of vertices of G? 
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It turns out that this quantity is maximized for a disjoint union of copies of Kg,q. Let’s start by doing this in a 


special case: 


Theorem 10.23 (Kahn) 


For a bipartite n-vertex d-regular graph G, 


KGS WK al" 


Equality holds if and only if G is the disjoint union of copies of Kg.g. 


Proof. Pick a bipartition of V(G) = AUB, and let X = (Xv)vewe) be the indicator vector for an independent set of 
G chosen uniformly at random. (In other words, pick a random independent set, and put a 1 for each vertex in the 
set and 0 everywhere else.) Then the entropy of this variable is just H(X) = logs(i(G)). 


How can we upper bound this? X is not necessarily uniform or independent on the vertices, but we can still write 
logo(1(G)) = H(X) = H(Xa) + H(Xe|Xa). 


Observe that because the graph is d-regular and bipartite, each vertex in A lies in the neighbor sets of d vertices in 


B. Therefore, we can simplify the first term using Theorem 10.16 and also bound the second term by subadditivity: 


H(X) <=> H(Xneay) +o A%IXa) 
beB beEB 


Dropping conditioning on the second term (forgetting about the non-neighbors), 


1 
H(X) < 5 SS H(Xney) + 9 A(XolXnve))- 
beB bEB 


Fix a b © B. We upper bound the expression 
H(Xwoy) + dH(Xo|X cp): 


We want to relate this to the entropy of i(Kg,q) somehow: we will do so by replacing X, with d identical independent 


variables x beeey xO that have the same distribution given XN(b) as the original Xp. Then, 


H(Xnyeoy) + dH (XolXwcoy) = A(Xneo)) + (XP Xwey) Fo +4 (XX wey) 
= H(Xnoo) + A (X49, ++ X21Xne0y) 


a H(Xwoy. X, ae a), 


where the last equality follows from the chain rule. The key observation is that the joint random variable Y = 
(Xi(b), x, tee xe) is the indicator variable of some random independent set of Kg,g: Xj) Corresponds to the d 
vertices on the left side and the d variables x correspond to d different vertices on the right side! The values that 
Y takes correspond to independent sets, because the original X, (and thus none of the copies) is never 1 if there's a 
1 in any coordinate of Xjyp). 

This distribution of Y may not be uniform, but we can still upper bound its entropy by the entropy of the uniform 
distribution over independent sets of Kg.g, which is (by Lemma 10.3 as always) logo(i(Ka.a)). 

n 


Our graph G is d-regular, so the two pieces of the bipartition have size 5. Because the above bound holds for 


every DE B, 
; n ; 
logs i(G) < 3d logo(i(Ka,a)) 
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as desired. 


In this proof, we used almost nothing about independent sets, and that motivates us to generalize this result. 


Definition 10.24 
A graph homomorphism G — H is a map of the vertex set V(G) — V(H) such that every edge uv € G is 
mapped to an edge ¢(u)@(v) in H. 


Example 10.25 


Here are two examples of graph homomorphisms: 


- Independent sets: Let H be the graph on two vertices {0, 1} with an edge between 0 and 1 and a self-loop 


on 0. Then, a map @: (V(G)) > H induces a homomorphism if and only if @-+(1) forms an independent 


set. 
+ g-colorings: Let H = Kg. The proper q-colorings of a graph G correspond to homomorphisms from G to 


H: color each vertex in G mapping to / with the color /. 


Theorem 10.26 (Galvin-Tetai) 
Let G be an n-vertex, d-regular bipartite graph, and let H be any (possibly looped) graph. Let Hom(G, H) to be 


the set of homomorphisms from G to H: then 


|Hom(G, H)| < |Hom(Ka,a, 4)|"/24. 


The proof of this result is identical to the proof of Theorem 10.23. 


Corollary 10.27 
Let G be an n-vertex d-regular bipartite graph, and let q € N. Let cg(G) denote the number of proper q-colorings 
of G: then 

GG) = Ge (Kua) 


Proof. Let X be the vector of colors of a uniformly random coloring of G, and the rest follows as above. 


Is it possible to prove an analog of Theorem 10.23 for general (not necessarily bipartite) graphs? The answer is 


yes! 


Theorem 10.28 
For a n-vertex d-regular graph G, 
i(G) $< [i(Kaa)l"**. 


Equality holds if (and only if) G is the disjoint union of copies of Kg.g. 


Proof. We will reduce to the bipartite case. 


Lemma 10.29 (Zhao) 
For all G, 


i(G)? < i(G x Ko). 
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Remark. There's lots of ways to denote a graph product. Given two paths G and H on 4 vertices, there's three main 
ways to construct a graph product of those paths: 


These, naturally, should be denoted GHH, G x H, and GRIH, respectively. 


Proof of lemma. We will construct an injection from Z(GLIG), the collection of independent sets in two disjoint copies 
of G, to Z(G x Kz). Think of GLIG as two copies of G, one above the other, and G x Kz as the same thing but with 
parallel edges replaced with crosses. 

Let's say we have some independent set S € Z(GUG): if we take those same vertices in G x Ko, we might not 


have an independent set, because there are some bad edges: treating GLI G as two layers 0 and 1, 
Epad = {uv € E(G): (u,0), (v, 1) € S}. 


All edges in Epaq correspond to an edge in G x K> with one endpoint in {u € V(G) : (u,0) € S}, which is the 
set of vertices of S (our not-quite independent set) in the top layer. Fix some ordering of the subsets of V(G) (for 
example, lexicographical, and take Q to be the first subset (in our ordering) of V(G) such that each bad edge in Epag 
has exactly one endpoint in Q. In other words, we're finding some canonical subset that “shows” our bipartition. 

Now swap each pair of V(G x K2) in Q (in other words, replace (v,0) with (v,1) and vice versa): we can check 


that this gives us an independent set in G x Ko. In addition, this mapping is injective: find the edges of E that 


correspond to Epag, and then we can find Q and reverse all of the swaps that we did. 


The graph G x Ko Is d-regular and bipartite with 2n vertices, so we can apply Theorem 10.23. This gives an 
inequality 
i(G) < i(G x Kp)? < i(Kaa)"!4, 


and we're done. 


This means that for independent sets, we can drop the bipartite hypothesis: can we do the same in general for 


graph homomorphisms? The answer Is no! 


Example 10.30 
Take H to be two disjoint loops. Any graph homomorphism into H sends each connected component to one of 


the two vertices of H, so the graph with the most graph homomorphisms into H is not a union of copies of Kg.g 


but rather a union of cliques Kg+1, since we're just trying to maximize the number of connected components. 


The above bipartite swapping trick does not work for some variants of the problem, such as the number of q- 
colorings instead of the number of independent sets. Recently, the problem for the number of proper colorings was 
settled using a different method by Sah, Sawhney, Stoner, and Zhao. 

Also, we can reduce “bipartite” to “triangle-free” in the graph homomorphism theorem. On the flip side, for any G 
with triangles, there exists a graph H for which the theorem is not true! However, we don't have good conjectures on 


classifications of the graph H. 
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10.6 More on graph homomorphisms: Sidorenko’s conjecture 


Definition 10.31 


Let t(H,G) denote the number of homomorphisms from G to H, divided by the total number of vertex maps 


|V(G)|MM, 


In other words, this is the probability that a uniform random vertex map induces a graph homomorphism. 


Conjecture 10.32 (Sidorenko) 
If H is a bipartite graph, then for all G, the homomorphism density 


t(H,G) > t(Ko, G)e), 


where t(K2, G) is the edge density. 


Rephrased, this can be phrased another way: among all graphs G with a fixed edge density, which G has the 
minimum number of copies of H? Sidorenko’s conjecture says (informally) that this is a “random” G. This is still an 


open problem, but let's look at a specific case. 


Theorem 10.33 
Let G be a graph with n vertices and m edges, and let P, be a three-edge path. Then 


2 
hom(Fy, G) > ie (= 


Proof. We'll use the entropy method, but the proof will look slightly different from the techniques that have been 
used so far. We're trying to lower-bound our quantity this time, so we don’t necessarily want to start with a uniform 
distribution. 


Basically, our goal is to construct a probability distribution on the set of homomorphisms Hom(P,, G) with entropy 


(2m)? 
ae 


at least logs ( Then by the uniform inequality, we can find that the entropy of the uniform distribution, which 
is logy of the number of homomorphisms, is at least that quantity. Note that a homomorphism is just a 4-vertex path. 

Construct X,Y, Z,W to be a 4-vertex walk on G in the following way: let XY be a unfirom edge of the graph, Z 
be a uniform neighbor of Y (allowing X), and W be a uniform neighbor of Z. The entropy of this distribution is, by 
the chain rule, 


H(X, Y, Z,W) = H(X) + H(Y|X) + H(ZIX, Y) + (WX, Y, Z). 


Note that if XY is a uniform edge, YZ and ZW are also uniformly distributed. This is because the vertex probability 
distribution of X is proportional to d(v): specifically, 


Prix =v)= _ 


This is true for Y as well, and now the distribution of Z as a uniform neighbor of Y is the same as the distribution of 


X as a uniform neighbor of Y: Z|Y ~ X|Y. So YZ is uniform, and so is ZW by the same argument. That means 


H(X, Y, Z,W) = H(X) + H(Y|X) + H(Z|Y) + H(W|Z) = H(X) + 3H(Y|X), 
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and this is (by definition) 


os =) log At, 35° a) logs d(v), 


2m 22m 
where logs d(v) is H(Y|X = v). Expanding and applying convexity, this is 


d(v 2n 2m 2m 
= logo(2m) + ay ay) logs d(v) > logo(2m) 4 a FR logs ; 
Vv 


. . (2m)? 
and rearranging gives an entropy of logs “S,-. 


It turns out that this proof works for every tree. For what kinds of graphs is Sidorenko’s conjecture harder to 


resolve? 


Fact 10.34 


The smallest open case of Sidorenko’s conjecture is the following Mobius graph: it’s Ks.5 minus a Hamiltonian 


cycle. 


It turns out this is the incidence graph of the smallest simplicial complex of the Mobius strip. One side is the set 


of vertices, and the other is the set of faces. 


Notably, the Mobius graph doesn’t fit the conditions of the following theorem, which resolves Sidorenko’s conjecture 


for certain graphs: 


Theorem 10.35 (Conlon-Fox-Sudakov) 
Sidorenko's conjecture holds for a graph H if there exists a bipartition H = ALI B such that there exists a vertex 
a€ Awith N(a) = B. 


There are also ways to interpret Sidorenko’s conjecture beyond graph theory! It turns out Sidorenko's conjecture 


where H is a three-edge path (Theorem 10.33) is equivalent to the following inequality: 


Proposition 10.36 
Given a function f : [0, 1]? — [0, oo], 


3 
i f(x,y) fly, z)f(z, w)dxdydzdw > (| Fxy)dxdy ; 
[0,1] [0,1]? 


As a grad student, Professor Zhao posted on Math Overflow a few years ago asking for a Cauchy-Schwarz proof 


of this. A week ago, Sidorenko actually answered it! 


Sidorenko. Think of g(x) = f f(x, y)dy as representing the “degree of x.” Then the left hand side becomes 


J fenFle.n)a(z)dxaydz 


but we can also rewrite the graph as a path u + x > y — z, so the left hand side is also 


J 900F te vydxayaz, 
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Applying Cauchy-Schwarz, 
LHS > f o(x)*?F(x, y)fley)ale)?? 


and since this integral is symmetric with respect to x and z, we can write this as 


-| ( / als)tFx yar) ay > ( / ats)lF(x yey) 


by Cauchy-Schwarz, and now we can integrate out 


( i ata)dx) > ( / atrdox) = ( / Foe oxdy) 


and we're done. 
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11 The occupancy method 


11.1 Introducing the technique 


Let’s think more about how to approach independent sets in graphs — in physics, this is related to the hard-core model. 
Imagine a physical system with solid balls that cannot overlap (for example, in a system of atomic particles). This 
corresponds to having “independent sets,” since we don't want adjacent vertices to be occupied. 

Consider picking a random independent set / C V(G), so that every set / is chosen with probability proportional 
to All. Here, is known as the fugacity. Normalizing this probability, the denominator is something that will come 


up frequently: 


Definition 11.1 


Define the independence polynomial or partition function 


Pg(A) = Da eae 


! independent set 


Note that the probability of a set / being picked is 


In particular, if X = 1, then Pg¢(1) = i(G) (the number of independent sets) and we're choosing our independent 


set uniformly at random. We actually have a result bounding the value of our partition function across all graphs G: 


Theorem 11.2 (Kahn, Zhao, Galvin-Tetali) 


If G is an n-vertex d-regular graph, and A > 0 is our parameter, 


Pay Pee 


Previously, we proved this theorem for A = 1. A proof for general was recently found (credited to Davies- 
Jenssen-Perkins-Roberts) that is very probabilistic! Here, we have a change of perspective: instead of trying to count 


everything directly, we find an “observable” by sampling at random. 


Definition 11.3 
Let the occupancy fraction be the expected fraction of V(G) contained in the random independent set / that 


we've chosen: 
iL 


|V(G)| 


aG(A) = 


{| |]. 


We can explicitly write this out and then rewrite it as a derivative: 


1 Ad 
= S /| P 
nP6(A) ; me ndr log Fa(). 


This is related to the cumulant, which is the logarithm of the moment generating function. 
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Theorem 11.4 
Let G be a d-regular graph, and let A > O be our fugacity parameter. Then 


a@G(A) < @K,,(A). 


So sampling an independent set, the expected number of vertices is maximized at Kg,g. In particular, note that 


logP(G) _ [* aG(T) logP\(Kaa) _ [* &a4(T) 
Mor ae ae ee 


and now we can bound over the integral to get the following, which is what we're trying to show in Theorem 11.2 


above about the partition function: 


Corollary 11.5 (implies Theorem 11.2) 


log Pe(A) e log Px, 4(a) 
WG) == S820. 


Let's first prove a special case of our theorem: 


Proof of Theorem 11.4 for G triangle-free. Pick a random independent set / (under our hard-core model) on G with 
fugacity ». Call a vertex v occupied if v € / and uncovered if N(v) M/ = @; that is, v has no occupied neighbors 
(though v can still be occupied). 

If v is occupied, it must be uncovered (because a vertex and its neighbor can’t both be in an independent set), and 


we also have the following facts: 


- If v is uncovered, v can be occupied or not: the probability 


N 
Pr[v occupied|v uncovered] = es 


because v acts In isolation. 
¢ Meanwhile, what about 


Pr[v uncovered|v has exactly j uncovered neighbors]? 


The covered neighbors can't be occupied, so we don't need to worry about them. We know that G is triangle- 
free, so N(v) is an independent set, and if we condition on everything farther away from v than N(v), the / 
uncovered neighbors are independent of each other. Thus, this is just 


1 
(+ Ay 


So now let’s find @G&(A) in two different ways. For one, this is just 


1 : 1 » 
= S- Pr(v occupied) = 2 Te S- Pr(v uncovered), 


veV veV 
since all occupied vertices are uncovered. But by the second fact above, we also know that this is equal to 


d 


ba S- Pr(v has j uncovered neighbors)(1 + A) ~ 
veVv j=0 


» 
1+A 


a 
n 


by Bayes’ rule. 
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Now here’s another way to compute this quantity: 


ag(A) = = S- oy Pr(u occupied), 


v ueN(v) 


using the fact that G is d-regular, so each vertex is equally likely to come up. This can then be written as 


1 A 
= a > Pr(u uncovered). 
vi ueN(v) 
This can be thought of as running a two-part experiment: pick / from the hard-core model, and then pick v to be a 
uniform random vertex of G. Let Y be the random variable equal to the number of uncovered neighbors of v with 
respect to /. 


On one hand, the occupancy fraction @G(A) is 


» 
1+A 


Qa(A) = [(1 +A)" 


from the first calculation. But by the second calculation, we also have 


we() = 57S EII. 
Setting these equal, 
a[(1+A)-*] = “ [Y]. 
where we should remember that Y is a random variable supported on {0,1,--- , d}, since our graph is d-regular. 


But now instead of considering Y coming from this specific process, let’s imagine Y is any probability distribution 
supported on {0,1,--- , d} that satisfies the condition we've just derived. Our goal is to show that 


1 A 
d1i+~A 


LY] < Aka q(A)- 


Note that this can be simplified as a linear program: let x, = Pr(Y = k), and then we want to maximize 


d 
RLY] = So kxx 
k=0 


under the linear constraints : P 
_ le *) = 
EO DN 1D (+9) 7) =o. 
It turns out that the maximium for this linear program occurs for the value that arises from G = Kg,g. This is 
because we have convexity of Y > (1+ A)~”: maximizing (1 + A)~” when conditioned on the total expectation 
happens when we concentrate everything at {0,d}. In addition, for every A, there is a unique random variable Y 


supported on {0, d} satisfying the constraints we want. The one that results from Kg, satisfies all such constraints: 


thus, it must be the maximizer. 


But remember that we've been doing this for G triangle-free. How are things different when we look at G in 


general? 


Proof in general. Again, we do a two-part experiment: pick / to be a random independent set from the hard-core 
model, and choose v to be a uniform random vertex of G. Let H be the graph induced by uncovered neighbors of v 


(beforehand, we only cared about the number of such neighbors for calculations). 
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So now we repeat the calculation: the first calculation becomes 


we(r) = ELPA 


and the second becomes 


: ; rA_| FL | 
—E[number of occupied neighbors of v] = —E|—4~ |. 

av aie Iq Foy 

So now instead of enumerating over all possible distributions on Y, we can set up a similar linear program: our main 


constraint is that the two ways of finding @G(A) are equal. Then we just need to show that our guess for the optimal 


solution is correct, and this can be done by showing that all of the dual constraints are satisfied. 


11.2 An alternative approach to the above problem 
Here’s another way to set up a linear program! Again, let’s consider @G(A) in two different ways (we'll drop G and A 
throughout for sake of notation). Then 


@ = Pr(v occupied) = ToA Pr(v uncovered). 


Letting X be the number of occupied neighbors of v, we have a random variable that depends both on our independent 
set / and our vertex v. Note that v is uncovered is the same as saying that X = 0. 


On the other hand, if we pick v and then pick a uniform neighbor u, then u is also a uniform random vertex 


(because G is d-regular). Now 


ee | . cx 
= S- Pr(u occupied) = aq [X] 
u€N(v) 
by linearity of expectation. We know that X takes on one of the values {0,1,--- , d}: denote px to be the probability 
Pr(X =k). Since we counted @ in two ways, we can set the values equal: 
1_ 
TX Pr(X =0) = a [X], 
and plugging in values yields " F 
= + 2ps. + *+*-++ dpa). 
TH po = GlPr + 2P2 pa) 


X is some variable with various constraints, such as this one, and we can relax the problem by throwing in more 
constraints to form a linear program. But somehow this isn’t capturing the whole system, since this doesn’t give a 
strong enough answer. 

So what else can we do? We can consider the probabilities p, and p,—1 and try to come up with relations between 
them. If exactly k neighbors of v are occupied, we can produce another independent set /’ with k — 1 neighbors by 
removing one of them from /. There are k ways to remove a neighbor, and there are at most d — k + 1 ways to go 
back to /: putting this together with the A factor from larger independent sets, we have 


APk=1 Dk 
k ~d—-k+i1 


for all 2 << k < d. Throw these into our linear program as well: we now want to maximize zox Po given fo,--* , Pr = 
0, Po +--+: + px = 1,, plus the constraints we've found above. This will give us some upper bound to @G(A) for a 
d-regular graph G — it may not be optimal, since we've only considered some of the constraints. 


But it turns out this is indeed enough: 


113 


Lemma 11.6 


Pk is an equality. 


If (Po,°+* , Pg) is a maximizer of our linear program, then every inequality of the form AP eae 


Proof. Otherwise, if there is some k with a strict inequality 


APk—1 Pk 
k d—k+1' 


and we can perturb our ps a bit: increase po by €, decrease px_1 by (AX + k) €, and increase px by (% +(k— 1)) E 


and we have a new maximizer (contradiction). 


So now we have a full-rank system of linear equalities, so there exists a unique solution. Indeed G = Kg.q satisfies 
all of the equalities, and we're done. 

The lesson here is that picking a uniform random v and considering it locally gives linear constraints. By only 
looking at those constraints, we can get some bound on the occupancy fraction, and this is usually enough to solve 


the problem. 


11.3. Further bounds with the occupancy method 


Let's try to lower bound the occupancy fraction instead of upper bounding it. This was initially developed by Shearer: 


“triangle-free graphs have large independent sets.” 


Theorem 11.7 
Fix a parameter » > 0, and let G be a triangle-free graph with maximum degree d. Then 


logd 
d 


AGA) > (1 + 0¢-00(1)) 


In comparison, remember that every max-degree-d graph has an independent set of size at least qa (take a vertex, 


remove it and its neighbors). That means that we're gaining a factor of log d here by having G be triangle-free. 


Proof. Pick | according to the hard-core model again, and let v be a uniform vertex of V(G). Let Y be the number 
of uncovered neighbors of v, so we now also care about the neighbors of neighbors of v. 

Just like last time, there's two different ways to write the occupancy fraction. Because we have a triangle-free 
graph, the neighbors of G behave independently of each other, so the number of uncovered neighbors satisfies 


» 
1+A 


aG(A) = [(1 + A)~"] 


where the expected value term is the probability that none of the neighbors of v are in /. By convexity, we can bound 
this: 


oe 


Lay hs | 
2174+) 


Now since G has maximum degree d, which is similar to being d-regular, we also know that 


mn 
1+A 


a[Y] 


ale 


S- Pr(u occupied) | = 


u€N(v) 


Q}rR 
iS 


aG(A) = E[Pr(v occupied)] > 
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by linearity of expectation. The occupancy ratio Is then at least the maximum of the two estimates: 


r 1 
qz> EY] +7 
a> ymax { (149) i ae 


Note that one of these expressions decreases with E[Y] and the other increases. That means that there is some 


absolute constant we can find here: 


r a 
> __ qj ee 
i 5 migmax{ (142) : yt, 


and after some optimization, this indeed yields the result we want. 


What consequences does this have? First of all, there are other situations where similar techniques and theorems 
apply: in fact, we saw one earlier when we were talking about dense sphere-packings in high dimensions. The hard-core 
model models non-overlapping spheres, and we can set up problems in similar ways where we draw a sphere packing 
according to some distribution to find the expected fraction of space taken up. Doing the calculations, we found a 
sphere-packing density of n2—” in R”. This is close to the best we know for almost all n, and it’s definitely better than 
the 2—" that we got with a greedy packing. (Notice that we have the same characteristic log term as in our graph 
theory problem.) 

Similarly, we can pack spherical caps on a sphere, which is like saying that we want points on a sphere that are 
pairwise separated by some angle. The most prominent case of this is called the kissing problem, which asks about 
the maximum number of unit balls that are nonoverlapping but all touch a central unit ball. This problem is interesting 


even in 3 dimensions! 


11.4 A useful corollary: Ramsey numbers 


We can translate the occupancy number statement directly into a graph theoretic statement: 


Corollary 11.8 


Every triangle-free graph on n vertices with max degree at most d contains an independent set of size at least 


(1 + Og-s00(1)) 9920. 


This actually gives us a bound for the Ramsey numbers: 


Corollary 11.9 


We have 
k? 


log k’ 


R(3,k) < (1+ 0(1)) 


In other words, there exists an n ~ “A such that every graph on n vertices has either a triangle or a large 


independent set. 


Proof. If our graph is triangle-free, every neighborhood is an independent set. So if any vertex has degree k, we 


automatically have an independent set of the desired size. Otherwise, by the corollary, there exists an independent set 


of size at least (1 + o(1)) 22 -n, and choosing n = (1+ o(1)) Ay provides us with the independent set of desired 


size. 
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This is essentially the best upper bound we have, and we also know a pretty close lower bound: 


2 
R(3, k) > € 4 o(1)] ar 


To construct a graph with this many vertices, remember that we found lower bounds to R(k, k) by taking a random 
graph. Use a similar philosophy here: construct our graphs randomly, but use a triangle-free process. Start with an 


empty graph on n vertices, and keep adding uniform random edges subject to the constraint “don’t make triangles.” 


Remark. /n contrast, we don't even know the order of magnitude for R(4, k) yet. 


11.5 Back to independent sets 


We found an upper bound on i(G)V/V() earlier on: can we find a way to minimize this quantity? Which graphs have 


the minimum number of independent sets? 
Remark. Two different students guessed “Kg41" and ‘random.” 


It turns out that as stated, among all d-regular graphs G, the minimizer is Kg41. However, if we restrict ourselves 
to bipartite d-regular graphs G, the answer becomes “random” or “a d-regular infinite tree.” 

One way to think of this is to consider a “2-lift” G’: take two copies of G, and replace edges with their “crossed” 
versions. Then we have i(G’) < i(G)?, and now by repeatedly lifting to destroy small cycles in the graph, we find that 


the number of independent sets, normalized, approaches some constant: 
i(G,)/Y(Gn) — C4 


if the girth of G, goes to infinity. In other words, we have a “tree-like” graph! 


Let's modify the question so that neither of these answers is allowed: 


Problem 11.10 
Let G be a 3-regular graph. How do we maximize i(G)V/V() if we're not allowed to have 4-cycles? Similarly, how 


do we minimize this quantity if our graph is triangle-free? 


Theorem 11.11 (Perernau-Perkins, 2018) 


Among 3-regular graphs G without cycles of length 4, i(G)!/”( is minimized by the Peterson graph and maximized 


by the Heawood graph. 


Here are the Peterson and Heawood graphs, respectively: 


These are Moore graphs - they are essentially the smallest graphs that are d-regular with a specific girth condition. 
The idea here is that we can throw girth conditions into our linear program, because the additional constraints are 


local. 
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11.6 Proper colorings in graphs 


Let’s look at one more example: let Cg(G) be the number of proper q-colorings of our graph G. Recall that we 
want to maximize (CAG) across d-regular graphs G: this can be done using the Potts model in statistical 
physics. Basically, sample a coloring c : V — [q]< not necessarily proper, where a coloring occurs with probability 
proportional to B'"°), where m(c) is the number of monochromatic edges. (Note that the parameter G acts sort of 
like the parameter A, and in the end, we can set B = 0. Notably, G6 = 1 gives a uniform coloring.) 


So now we have a partition function for the Potts model 


Z6.q(8)= S> pm, 


c:V-[q] 
note now that the log derivative 
_ 6B d _ Elm(c)] 
YealP) = SG) ap 997 = ~e(6) 


is the expected fraction of monochromatic edges. In physics, this is known as the “internal energy.” 


Theorem 11.12 
For all 3-regular graphs G, q>2, andO <6 <1, 


Uhr (6)) 22 fre yA (6)) 


Integrating this (the inequality is flipped because we integrate from 1 to B < 1), we get 
Cy(G)YVO < cg(K33)/°. 


Proving this requires a similar kind of trick of constructing the two-part experiment and finding linear constraints. 
However, this has lots of variables - one for each possible configuration. We don’t actually know how to do this by 
hand, but we plug this into a computer, and indeed K33 is the maximizer! Unfortunately, we don’t know how to get 


this to work for d-regular in general: the computation time is far too large. 
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12 A teaser for “Graph Theory and Additive Combinatorics” 


For this last lecture, titled “triangles and equations,” we're going to be previewing 18.217, “Graph Theory and Additive 


Combinatorics,” being taught next fall. (This was a class Professor Zhao taught in Fall 2017 as well!) 


12.1 A glance at Fermat’s last theorem 
Like many other mathematicians of the time, Schur thought about Fermat's Last Theorem, which looks for solutions 
X"4Y"4+ 2", n> 3. 


He considered the following idea: why not reduce this mod p? If we can show that for infinitely many different primes, 
there are no solutions mod p, then it must not have any solutions in the integers. Unfortunately, this doesn't work, 


and he proved it doesn't work. Instead, we got the following result: 


Theorem 12.1 (FLT mod p) 


For all n, the equation 


X"+Y"=Z" mod p 


has a nontrivial solution (where p does not divide XY Z) for all sufficiently large p. 


In fact, Schur proved a more combinatorial Ramsey-type result: 


Theorem 12.2 (Schur) 
For all r, there exists an integer N such that if we color {1,--- , N} with r colors, then there exists a monochromatic 
solution to X + Y = Z. 


The modern way to view this is that we can reduce to Ramsey's theorem. 


Proof. Given a coloring of @: [n] > [r], we color a complete graph K,+1 with vertices [NV + 1] by coloring an edge 
(i,/) with colors $(|i — j|). By Ramsey's theorem, if N is sufficiently large, there exists a monochromatic triangle 


(think of this as using Pigeonhole principle). Then if those vertices are i <j < k, then 


O(k — 1), O(k — J), OU — 1) 


are all the same color, and now we've found a monochromatic X +Y = Z: letx =j—-—i,y=k—Jj,z=k-—i. 


So this is a connection between a number-theoretic problem and the corresponding graph-theory problem. 


Fact 12.3 

This implies FLT mod p: let H be the subgroup of (Z/pZ)* consisting of nth powers {a” : a € (Z/pZ)*}. Then 
partition the numbers {1,2,--- ,(p — 1)} into H-cosets: this uses at most n colors. If p is sufficiently large in 
terms of n, Schur’s theorem tells us that there exists some coset with a monochromatic solution: if the coset is 
aH, then 


ax" + aY" = aZ" mod p; 


multiply by the multiplicative inverse of a to get the result. 
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This is a “baby example” of what is called additive combinatorics, which also goes under the name of “combinatorial 
number theory.” Usually, number theory is about multiplying primes together: here, we care more about combinatorial 


properties of the numbers. Let’s do a few more examples. 


12.2 Turan’s theorem and more 


The question here is to find the maximum number of edges in an n-vertex triangle-free graph. The answer is that 
we want a completely bipartite graph Kye) fey: 

Well, the analogous problem in graph theory is to find the maximum subset of [N] without a solution x + y = z. 
Picking odd numbers, or picking the numbers greater than on gives half (rounded up) of the total subset. We can’t 
do better, because we can let M be the maximum element in our subset: then both x and M — x can't appear in our 
set, so we can have at most half density. 


Here’s a way to make that question a bit harder: 


Problem 12.4 


What's the maximum size subset of [NV] without a solution x + y = 2z (where x, y, z aren't all equal)? In other 


words, we want a three-term arithmetic progression? 


This problem is hard to answer, and it’s even hard to find good guesses. One thing we can do is try this greedily: 
add 1 and 2, and then successively add numbers if they don't create 3-term arithmetic progressions. This gives only 
the numbers with digits 1 and 2 in base-3 representation, and it has density N'°932, since we pick 2* of the first 3% 
positive integers. 

This is nowhere near the best, though: there exists an Ni-0(2) construction, which we won't discuss here. On the 


other hand, due to Roth, we also know that the size is sublinear: we cannot get a positive proportion of [N]. 
N1-°) < size of subset < o(N). 


This is probably Roth's second most important result, and it’s driven a lot of research in additive combinatorics. 


But now, what's the analogous question for the 3-term AP problem in graph theory? 


Problem 12.5 


What's the maximum number of edges in an n-vertex graph, where every edge is contained in a unique triangle? 


Analogously, the answer here is 


n2—°() < number of edges < o(n’). 
Let's show the lower bound: 


Proposition 12.6 


We can find a graph with n2-°() edges where every edge is contained in a unique triangle. 


Solution. Let's say A C [Z/NZ] is a subset without three-term arithmetic progressions, where N is odd (just to avoid 
technicalities). Then we can actually construct a graph where every edge is contained in a unique triangle: we have 
three vertex sets, X = Y = Z = Z/nZ, and we put an edge between x € X and y €Y if y—x € A, an edge between 
y €Y and ze Z if z—y € A, and an edge between x € X and ze Z if S* € A. 
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What are the triangles in this graph? Note that y—x,z—y, *5~ form an arithmetic progression, but A doesn’t have 


any 3-APs except for the trivial ones: x, x,x. So every edge here lies in exactly one triangle! This same construction 


also proves that the upper bound of the graph theory version implies the upper bound in the AP-subset problem. 


We haven't discussed how to prove any of the bounds, but we'll do that in the course next semester. A lot of 


interesting tools are used to achieve this, and generalizations and extensions have blossomed into a new field. 


12.3. A generalization: more modern approaches 


Theorem 12.7 (van der Waerden) 


For all r and k, there exists N such that if [N] is colored with r colors, then there is a monochromatic k-term 


arithmetic progression. 


Erdos and Turan believed that the real reason for van der Waerden’s theorem is not because we use k colors, but 
because one of our color classes has positive density. This led them to a conjecture in 1936 that was only resolved by 


Szemerédi in 1975, resulting in the following landmark theorem: 


Theorem 12.8 


For every 6, there exists N such that every subset A C [N] with |A| > 6N contains a k-term arithmetic progression. 


The proof is difficult and involved enough that we won't even prove it next semester. But this theorem has been 
looked at from other directions, and this has led to some success: the results can also be shown with ergodic theory, 
and this turns out to be more general in some sense. In addition, a Fourier analytic approach (by Roth) also works, but 
it doesn't work for 4-term arithmetic progressions. (We may have also heard of the “Hardy-Littlewood circle method.”) 


Recently, a newer approach was found that generalizes Roth's proof to “higher-order Fourier analysis.” 


Remark. Normal Fourier analysis considers correlations of a function with an exponential phase 


lf (xe, 


In contrast, quadratic Fourier analysis looks at correlations with quadratic exponential phases: 


a[F(x)el], 


and these turn out to be essential when studying four-term arithmetic progressions. 


12.4 A principle about approaching complicated problems 


One last idea that developed out of Szemerédi’s theorem is the “regularity lemma.” Each of these approaches has its 
own tools, but overall, there are some connections. The idea here is structure versus randomness, or signal versus 
noise: the idea is that a system should be able to be written down as a piece that Is structured, plus a “pseudo-random 


piece.” 
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Example 12.9 
If we want to understand 3-term arithmetic progressions in [N], we may want to instead consider functions 


f :Z/n—R. These can be written via the Fourier inversion formula 


AS anes, 


where w is an Nth root of unity. The coefficients f(r) = E[f(x)w~™] are generally not large (in some sense), so 


we can write out our sum as a sum of parts where |f(r)| is large (structured, few of them) and where |f(r)| is 


small (looks pseudorandom). 


Example 12.10 


If we start with a symmetric matrix A € R"*”", we can decompose it in terms of its spectrum: 
A= ys AMV) - 
i 


We can again separate this into terms where the eigenvalues are large versus small. 


For graph theory, there's a similar notion: we can always start by representing a graph by its adjacency matrix, but 


there’s also more combinatorial ways to do this. This leads us to a powerful tool in graph theory: 


Theorem 12.11 (Szemerédi’s regularity lemma) 


Informally, every graph can be decomposed into a bounded number of vertex parts (in terms of some error), so 


that almost all pairs of parts look “pseudorandom.” 


In other words, we can think of two vertex parts as being essentially defined by edge densities. Then the “structure” 
part of this is the density we assign, and the “randomness” is the rest of the graph. 


This is a powerful tool: it actually allows us to prove the o(n?) result for Problem 12.5. 


12.5 Graph limits 


Let's say we have a sequence of graphs G;, Go,--- and ask the question of “do these graphs converge?” If we say 


that G, =G (n, 5), we can say that the sequence converges to a limit, which is some constant function 5. (In other 


words, we don't care about the specific edges, but only global macroscopic pictures. ) 


Definition 12.12 


A sequence of graphs (G,,), converges if for all graphs F, the density t(F, G,) converges to some constant Cr as 


NPC: 


This is a very local property, but how exactly do we represent this convergence? 


Definition 12.13 


A graphon is a symmetric, measurable function W : [0, 1]? > [0, 1]. 
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We can think of our graphs as adjacency matrices: they'll have a bunch of Os and 1s. Think of the 1s as black 
squares and the Os as white squares: as n > oo, and our eyesight becomes poor, we see a grayscale image. 


This isn’t quite correct yet: for example, what's the limit of Ky/2,n/2? This can either look like a blown up version 


0 1 
of ( of or it can look like a checkerboard! The latter begins to look a lot like S, but the former is the actually 


correct answer. So there’s some subtleties that we're skipping over. 

Here’s some more motivation: let’s say | want to maximize x — x? for x € [0, 1], but | only know about rational 
numbers. We can still say the answer, but it’s a lot more contrived: we have to take some sequence of rationals to 
get to the answer. 

Well, instead, let's say we want to minimize the 4-cycle density in a graph with edge density 5. This is similar 
to the path of length 3 problem: the answer is to take a sequence of pseudorandom graphs with edge density 5 and 
number of vertices going to co. That’s kind of annoying to say, though: is there a nicer way to state the limit? The 
beauty of using graph limits is that we can just say our answer as W = 3. 

Proving these things exist requires Szemerédi's regularity lemma to represent graphs: this allows us to view graphs 
with this structure versus randomness decomposition. It's a nice fact, by the way, that every sequence of graphs 


contains at least one graph limit. 


12.6 A few open problems 


In 18.217, we'll discuss the structure of set addition: let A+ A be the set of all numbers {a+ 5: a,b € A}. We can 
ask questions like “what is the size of A+ A if |A] = n?” In the integers, the minimum is attained for A = [n], and the 


maximium is attained with random large numbers: this gives 
1 
Qn-1<|A+Al\< es ). 


But now, what can we say about A if A+ A Is small? For example, are there any properties that we know if 
|A+ A] < 100|A|? 
This means our set is not too random, or else we'd have quadratic pairwise sums. So there’s some arithmetic 


structure in A: 


Theorem 12.14 (Freiman) 


A must be contained in a “small” generalized arithmetic progression; that is, numbers of the form a+ c,d; + 
Coda +++ + Cdk. 


But there's still open problems around this theorem. In particular, the following is considered one of the most 


important conjectures in the field: 


Conjecture 12.15 (Polynomial Freiman-Ruzsa conjecture) 
There are two equivalent forms of this: we'll only state this over F4. 
+ Let A C F% be a subset with small doubling: |A+ A] < K|A|. Then there exists a subspace V where 
|V| < JA], such that |Vn Aj > K~OM| A). 


+ Given a function F3 — F% which is “almost linear” — (f(x + y) — f(x) — f(y)) takes on at most K values 


— then there exists a linear function g : F2 — F% such that f(x) — g(x) takes on at most KO) different 


values. 
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In the second statement, it’s easy to show that we can get 2: we just have g agree with f on a basis! Then all 
errors need to lie in the subspace spanned by f(x+y)—f(x)—f(y). The best result that is known is a quasi-polynomial 
bound: we have e(!o9 4)?" 

Finally, let's consider both addition and multiplication together: much like we define A+ A = {a+ b: a, be A}, 
define AA = {ab: a,b € A}. Then |A+ A| and |AA| can separately be made linear, but there’s a conjecture that this 


can't happen simultaneously: 


Conjecture 12.16 
Suppose |A| = n. Then 


max{|A] + |A], |AA]} > 2-0). 


Recent improvements have gotten us from n*/3-°) to n4*/3+© for a small constant c > 0, which is still very far 
from what we think is the truth. This is another example of the connections between graph theory and additive 
combinatorics: earlier on, we saw the Szemerédi-Trotter theorem about incidences between points and lines, and it 
turns out we can connect the earlier material here as well. The idea is that slopes of lines involve both addition and 
multiplication, so encoding that information into this problem here allows us to use point-line incidences to deduce 


results about sums and products. 


123 


